Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 9;11(1):358.
doi: 10.1038/s41597-024-03153-y.

Standardised Versioning of Datasets: a FAIR-compliant Proposal

Affiliations

Standardised Versioning of Datasets: a FAIR-compliant Proposal

Alba González-Cebrián et al. Sci Data. .

Abstract

This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature ("major.minor.patch") and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (dP, dE,PCA, and dE,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the dE,PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Flowchart summarising our approach to quantify data drift using several strategies. Each strategy employs an ML model and an associated data drift metric. First, for a Primary Source dataset, we build ML models and predictive models based on the Mean Squared Error (MSE models), conforming the Model Building Phase. Similarly, for a Revision dataset, i.e., a new dataset version, corresponding ML models are built. These models are used during the Model Exploitation Phase to compute the associated data drift metric (e.g., dE,AE, dE,PCA, and dP).
Fig. 2
Fig. 2
Values of the R2 coefficients of each ML model (a) and of the computation time required to estimate all the elements required for each of the data drift options (b) for each of the PS datasets (x-axis).
Fig. 3
Fig. 3
Values of the metrics (y-axis) when new batches of different sizes (x-axis) were added to the Primary Source dataset.
Fig. 4
Fig. 4
Values of the metrics (y-axis) when they were computed on Revisions with different percentages of variables (x-axis) transformed to a different scale with a cubic root transformation.
Fig. 5
Fig. 5
Values of the metrics (y-axis) when they were computed on Revisions obtained by down-sampling the original time series, reducing the sample size (x-axis) of the resulting new versions.
Fig. 6
Fig. 6
Values of the metrics (y-axis) when computed on batches of a 10% of the Revision subset for each one of the datasets, except for DS 04, with batches of a 25% size.
Fig. 7
Fig. 7
Examples of values of the trend component for variables referring to continents from the DS 03 used for the PS (black) and the Revision batches (red).
Fig. 8
Fig. 8
Four examples of increasing trend components for variables from the DS 07 used for the PS (black) and the Revision batches (red).

Similar articles

References

    1. Treloar A. The Research Data Alliance: globally co-ordinated action against barriers to data publishing and sharing. Learned Publishing. 2014;27:S9–S13. doi: 10.1087/20140503. - DOI
    1. DataCite Metadata Working Group. Datacite metadata schema documentation for the publication and citation of research data and other research outputs (2021).
    1. Wilkinson MD, et al. The FAIR guiding principles for scientific data management and stewardship. Scientific Data. 2016;3:1–9. doi: 10.1038/sdata.2016.18. - DOI - PMC - PubMed
    1. Allison DB, Brown AW, George BJ, Kaiser KA. Reproducibility: A tragedy of errors. Nature. 2016;530:27–29. doi: 10.1038/530027a. - DOI - PMC - PubMed
    1. Klump, J. et al. Versioning data is about more than revisions: A conceptual framework and proposed principles. Data Science Journal20 (2021).