Standardised Versioning of Datasets: a FAIR-compliant Proposal
- PMID: 38594314
- PMCID: PMC11003959
- DOI: 10.1038/s41597-024-03153-y
Standardised Versioning of Datasets: a FAIR-compliant Proposal
Abstract
This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature ("major.minor.patch") and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (dP, dE,PCA, and dE,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the dE,PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.
© 2024. The Author(s).
Conflict of interest statement
The authors declare no competing interests.
Figures








Similar articles
-
DREAMER: a computational framework to evaluate readiness of datasets for machine learning.BMC Med Inform Decis Mak. 2024 Jun 4;24(1):152. doi: 10.1186/s12911-024-02544-w. BMC Med Inform Decis Mak. 2024. PMID: 38831432 Free PMC article.
-
From Raw Data to FAIR Data: The FAIRification Workflow for Health Research.Methods Inf Med. 2020 Jun;59(S 01):e21-e32. doi: 10.1055/s-0040-1713684. Epub 2020 Jul 3. Methods Inf Med. 2020. PMID: 32620019
-
Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality.BMC Bioinformatics. 2019 Jun 17;20(1):339. doi: 10.1186/s12859-019-2929-8. BMC Bioinformatics. 2019. PMID: 31208324 Free PMC article.
-
Comparative Analysis of Classification Methods with PCA and LDA for Diabetes.Curr Diabetes Rev. 2020;16(8):833-850. doi: 10.2174/1573399816666200123124008. Curr Diabetes Rev. 2020. PMID: 31971112 Review.
-
Chemometric analysis in Raman spectroscopy from experimental design to machine learning-based modeling.Nat Protoc. 2021 Dec;16(12):5426-5459. doi: 10.1038/s41596-021-00620-3. Epub 2021 Nov 5. Nat Protoc. 2021. PMID: 34741152 Review.
References
-
- Treloar A. The Research Data Alliance: globally co-ordinated action against barriers to data publishing and sharing. Learned Publishing. 2014;27:S9–S13. doi: 10.1087/20140503. - DOI
-
- DataCite Metadata Working Group. Datacite metadata schema documentation for the publication and citation of research data and other research outputs (2021).
-
- Klump, J. et al. Versioning data is about more than revisions: A conceptual framework and proposed principles. Data Science Journal20 (2021).
MeSH terms
LinkOut - more resources
Full Text Sources