Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 22:12:755101.
doi: 10.3389/fmicb.2021.755101. eCollection 2021.

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Affiliations

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Valérian Lupo et al. Front Microbiol. .

Abstract

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.

Keywords: NCBI RefSeq; assembly; contamination; databases; genomes; phylogenomics; sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Taxonomic tree of the bacterial domain showing the fraction of contaminated genomes in each phylum with each method. Taxon identifiers of the 111,088 RefSeq bacterial genomes were passed to NCBI Common Tree tools to construct the tree [parameters: (1) include unranked taxa, (2) expand all]. Tree visualization was performed with iTOL and branches were collapsed at the taxonomic levels reported in the tree. Triangles are proportional to taxonomic depth. Proteobacteria are colored in orange, FCB group in green, Terrabacteria in red, PVC group in blue and the other phyla in dark gray. Green barplots are for genomes evaluated with CheckM and blue barplots are for Physeter. The fraction of genomes with a contamination level <5% is shown in a light color whereas those ≥5% are shown in a dark color. The number of genomes evaluated with each method is indicated by the height of the barplot on a ceiled logarithmic scale. For simplicity, the estimates for Ca. Saccharibacteria (2 contaminated and 12 uncontaminated genomes), candidate division NC10 (2 contaminated genomes), Ca. Atribacteria (2 contaminated genomes), and Ca. Bipolaricaulota (1 contaminated genome) are included in unclassified Bacteria. Completely contaminated phyla (e.g., Caldiserica, Nitrospinae, and Kiritimatiellaeota) are generally represented by very few genomes (i.e., one to three genomes). Among the more extensively studied phyla (11 to 37,487 genomes), some appear to be extremely contaminated, such as Balneolaeota, Synergistetes, and Chloroflexi, with, respectively, 54.5, 33.3, 16.9% of contaminated genomes, whereas other phyla are characterized by a very low contamination level, including Cyanobacteria (2.8%), Gammaproteobacteria (0.6%), or Chlamydiae (0.3%).
FIGURE 2
FIGURE 2
Overview of Physeter properties. (A) Distribution of contamination levels assessed by Physeter in k-fold mode. Genomes are ranked from the lowest to highest median level of contamination. Median levels are shown in a solid orange line, while minimal and maximal levels are represented as yellow and brown dots, respectively. GCF_003612345.1 and GCF_003611835.1 are examples of genomes having a low median level of contamination with some independent estimations showing a higher contamination level. The opposite case is illustrated with GCF_000241265.1. (B) Taxonomic distribution of contaminating sequences within each phylum. The relative contributions of each contaminating phylum were first averaged by genome over all 10 k-folds, then these genome-wise averaged values were averaged by tested phylum over all genomes.

References

    1. Bemm F., Weiß C. L., Schultz J., Förster F. (2016). Genome of a tardigrade: horizontal gene transfer or bacterial contamination? Proc. Natl. Acad. Sci. U. S. A. 113 E3054–E3056. 10.1073/pnas.1525116113 - DOI - PMC - PubMed
    1. Bowers R. M., Kyrpides N. C., Stepanauskas R., Harmon-Smith M., Doud D., Reddy T. B. K., et al. (2017). Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35 725–731. 10.1038/nbt.3893 - DOI - PMC - PubMed
    1. Breitwieser F. P., Pertea M., Zimin A. V., Salzberg S. L. (2019). Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res. 29 954–960. 10.1101/gr.245373.118 - DOI - PMC - PubMed
    1. Buchfink B., Xie C., Huson D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12 59–60. 10.1038/nmeth.3176 - DOI - PubMed
    1. Challis R., Richards E., Rajan J., Cochrane G., Blaxter M. (2020). BlobToolKit – Interactive Quality Assessment of Genome Assemblies. G3 10 1361–1374. 10.1534/g3.119.400908 - DOI - PMC - PubMed