Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 17;14(1):189.
doi: 10.1186/s13104-021-05605-9.

Shortcomings of SARS-CoV-2 genomic metadata

Affiliations

Shortcomings of SARS-CoV-2 genomic metadata

Landen Gozashti et al. BMC Res Notes. .

Abstract

Objective: The SARS-CoV-2 pandemic has prompted one of the most extensive and expeditious genomic sequencing efforts in history. Each viral genome is accompanied by a set of metadata which supplies important information such as the geographic origin of the sample, age of the host, and the lab at which the sample was sequenced, and is integral to epidemiological efforts and public health direction. Here, we interrogate some shortcomings of metadata within the GISAID database to raise awareness of common errors and inconsistencies that may affect data-driven analyses and provide possible avenues for resolutions.

Results: Our analysis reveals a startling prevalence of spelling errors and inconsistent naming conventions, which together occur in an estimated ~ 9.8% and ~ 11.6% of "originating lab" and "submitting lab" GISAID metadata entries respectively. We also find numerous ambiguous entries which provide very little information about the actual source of a sample and could easily associate with multiple sources worldwide. Importantly, all of these issues can impair the ability and accuracy of association studies by deceptively causing a group of samples to identify with multiple sources when they truly all identify with one source, or vice versa.

Keywords: COVID-19; Data quality; Databases; Genomics; Metadata; SARS-CoV-2.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The number of samples produced by each (a) “originating lab” and (b) “submitting lab” and the corresponding number of errors (or inconsistencies) for that respective lab. Color encodes the respective number of data points at a given position on the plot, with positions with fewer points shaded blue and positions with more points shaded red. c Some observed examples of misspellings, inconsistent naming conventions, and highly ambiguous entries. d A hypothetical phylogenetic tree displaying an example of a case in which errors in “originating lab” metadata might impede association studies with regard to SARS-CoV-2 genomic data. We denote true mutations with black dots and ambiguous mutations with red dots on the phylogeny. In this case, ambiguous “N” alleles occur multiple times across a phylogeny at a given site and all stem from the same lab. Metadata errors (shown in red) cause this ambiguous “N” allele to appear as if it is associated with 4 different labs (rather than 1). Such a site could impair phylogenetic inference and should be flagged in the SARS-CoV-2 masking recommendations but could be overlooked as a result of these errors [20, 24]

References

    1. Goble C, Corcho O, Alper P, De Roure D. e-Science and the semantic web: a symbiotic relationship. In: Discovery science. Berlin, Heidelberg: Springer; 2006. pp. 1–12.
    1. Matters MD, Lekiachvili A, Savel T, Zheng Z-J. Developing metadata to organize public health datasets. AMIA Annu Symp Proc. 2005;2005:1047. - PMC - PubMed
    1. Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, et al. The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol. 2008;26:541–547. doi: 10.1038/nbt1360. - DOI - PMC - PubMed
    1. McMahon C, Denaxas S. A novel framework for assessing metadata quality in epidemiological and public health research settings. AMIA Jt Summits Transl Sci Proc. 2016;2016:199–208. - PMC - PubMed
    1. Martin MA, VanInsberghe D, Koelle K. Insights from SARS-CoV-2 sequences. Science. 2021;371:466–467. doi: 10.1126/science.abf3995. - DOI - PubMed