Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 12;21(1):115.
doi: 10.1186/s13059-020-02023-1.

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Affiliations

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Martin Steinegger et al. Genome Biol. .

Abstract

Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to "complete" model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator.

Keywords: Contamination; GenBank; Genomes; RefSeq; Software.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
How contamination occurs and how Conterminator detects it. a DNA extraction from an organism (red) is imperfect and often introduces contamination by other species (violet). DNA sequencing then generates short reads that are assembled into longer contigs. Contaminated DNA is typically assembled into separate, small contigs, but sometimes is erroneously included in the same contigs as DNA from the source organism. Contigs may also be linked by scaffolding, which can produce scaffolds containing a mixture of different species. Final assemblies are submitted to GenBank, and higher-quality assemblies are entered in RefSeq. b Conterminator detects contamination in proteins and nucleotide sequences across kingdoms, e.g., bacterial contaminants in plant genomes. The following describes the nucleotide contamination detection workflow. (1) We take taxonomically labeled input sequences and cut them into non-overlapping segments of length 1000 and extract a subset of k-mers. (2) We group the k-mers by sorting them and compute ungapped alignments between the first and all succeeding sequences per group. (3) We extract each region of the first sequence that has an alignment to other kingdoms that is longer than 100 nucleotides with a sequence identity greater than 90 %. We perform an exhaustive alignment of the input sequence segments against the multi-kingdom regions. (4) We reconstruct contig lengths within scaffolds by searching for the scaffold breakpoints (indicated by N characters in the DNA sequence) on the left and right side from the alignment start and end position. We predict that contamination is present if an alignment hits a contig that is shorter than 20 kb that aligns to a different kingdom with an contig length longer than 20 kb
Fig. 2
Fig. 2
Results of contamination within the RefSeq and GenBank. a Distribution of contaminated species in RefSeq across five kingdoms: Bacteria and Archaea (violet), Fungi (yellow), Metazoa (red), Viridiplantae (green) and other Eukaryotes (turquoise). b Sankey plot of the top 13 contaminated species in RefSeq. We show the taxonomic ranks domain, kingdom, phylum, and species. Numbers shown above each taxonomic node indicate the total number of contaminated sequences. The tree uses the same color code for kingdoms as in a. c, d Same as a, b but for GenBank
Fig. 3
Fig. 3
Contamination in the reference genomes of Homo Sapiens and Caenorhabditis elegans. a Alignment of Homo sapiens alternative scaffold NT_187580 of chromosome 10 against RefSeq. Chromosome 10 (NC_000011.10) aligns with 100 % sequence identity from position 1 to 169918. The remaining 18,397 residues of NT_187580 align only to Acidithiobacillus thiooxidans at 98 % sequence identity. Shown are only 6 out of 15 alignments to Acidithiobacillus thiooxidans. b The X chromosome of Caenorhabditis elegansNC_003284.9 aligns on the left and right flanking position around 5907856 until 5912458. E. coli genomes aligns from 5907856 to 5912087, a total of 4231 residues. Shown are only 3 out of 8199 alignments to E. coli
Fig. 4
Fig. 4
Multiple sequence alignment of 31 spurious bacterial proteins encoded on short contaminated contigs. Shown here are 31 out of 185 spurious proteins from bacterial genomes. A majority of the sequences are 100 % identical. The only differing residues are highlighted in white. This highly conserved “protein” is conserved on across different bacterial phyla, suggesting it is likely a contaminant that has been erroneously translated as part of automated annotation procedures. The respective short contigs (< 1 kb) encoding these spurious proteins align with high sequence identity and coverage to the Ovis aries genome

Comment in

References

    1. Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2019;47(D1):94–99. doi: 10.1093/nar/gky989. - DOI - PMC - PubMed
    1. Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Brief Bioinform. 2019;20(4):1125–36. doi: 10.1093/bib/bbx120. - DOI - PMC - PubMed
    1. Kirstahler P, Bjerrum SS, Friis-Møller A, la Cour M, Aarestrup FM, Westh H, Pamp SJ. Genomics-based identification of microorganisms in human ocular body fluid. Sci Rep. 2018;8(1):4126. doi: 10.1038/s41598-018-22416-4. - DOI - PMC - PubMed
    1. Arakawa K. No evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proc Natl Acad Sci USA. 2016;113(22):3057. doi: 10.1073/pnas.1602711113. - DOI - PMC - PubMed
    1. Salzberg SL. Horizontal gene transfer is not a hallmark of the human genome. Genome Biol. 2017;18(1):85. doi: 10.1186/s13059-017-1214-2. - DOI - PMC - PubMed

Publication types