Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Feb 19:6:190021.
doi: 10.1038/sdata.2019.21.

The variable quality of metadata about biological samples used in biomedical experiments

Affiliations

The variable quality of metadata about biological samples used in biomedical experiments

Rafael S Gonçalves et al. Sci Data. .

Abstract

We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample-a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples-a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1. Example metadata record from the NCBI BioSample.
An NCBI BioSample metadata record has a title, potentially multiple identifiers associated with it, an organism, a package specification (explained in Section 2.1), multiple attributes in the form of name-value pairs, a description with keywords associated with it, information about the record submitter, and finally accession details.
Figure 2
Figure 2. Mention of metadata packages in NCBI BioSample.
The chart shows the package names followed by the number (and percentage) of metadata records that use that package. The Generic package does not specify any required or optional attributes.
Figure 3
Figure 3. Metadata submissions to NCBI BioSample from 2009–2017.
The columns represent the total number of metadata record submissions to NCBI BioSample in a year, split between Generic and non-Generic records. The Non-Generic metadata records column contains data labels with the absolute number of records. Generic records make up nearly all the submissions in the early years of BioSample, and the bulk of the submissions even in recent years.
Figure 4
Figure 4. Quality of dictionary attributes in NCBI BioSample according to their type.
The columns show the number and percentage of attributes whose values are well-specified or invalid.
Figure 5
Figure 5. Quality of attributes in packaged metadata records in NCBI BioSample.
The columns represent the metadata attribute types. Each column shows the number and percentage of metadata attributes whose values are either well-specified or invalid.
Figure 6
Figure 6. Quality of attributes in metadata that co-exist in EBI and NCBI repositories.
The columns represent the metadata attribute types. Each column shows the number and percentage of metadata attributes whose values are either well-specified or invalid.
Figure 7
Figure 7. Metadata submissions to EBI BioSamples from 2009–2017.
The columns represent the total number of metadata record submissions to EBI BioSamples per year.
Figure 8
Figure 8. Mention of metadata packages in EBI BioSamples.
The chart shows the package names (or “Unpackaged” for records that do not specify a package) followed by the number and percentage of metadata records that specify that package name.
Figure 9
Figure 9. Quality of named attributes in EBI BioSamples.
The columns represent the metadata attribute types. Each column shows the number and percentage of metadata attributes whose values are either well-specified or invalid.

References

    1. Bruce T. R., Hillmann D. I. The Continuum of Metadata Quality: Defining, Expressing, Exploiting. in Metadata in Practice, (eds Hillmann D. I. & Westbrooks E. L.) 238–256 (ALA Editions, 2004).
    1. Park J.-R. Metadata Quality in Digital Repositories: A Survey of the Current State of the Art. Cataloging & Classification Quarterly 47, 213–228 (2009).
    1. Park J.-R. & Tosaka Y. Metadata Quality Control in Digital Repositories and Collections: Criteria, Semantics, and Mechanisms. Cataloging & Classification Quarterly 48, 696–715 (2010).
    1. Wilkinson M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018 (2016). - PMC - PubMed
    1. Zaveri A. & Dumontier M. MetaCrowd: Crowdsourcing Biomedical Metadata Quality Assessment. in Proceedings of the Bio-Ontologies Workshop (2017).

Publication types