Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Feb 21;23(1):60.
doi: 10.1186/s13059-022-02619-9.

Contamination detection in genomic data: more is not enough

Affiliations
Review

Contamination detection in genomic data: more is not enough

Luc Cornet et al. Genome Biol. .

Abstract

The decreasing cost of sequencing and concomitant augmentation of publicly available genomes have created an acute need for automated software to assess genomic contamination. During the last 6 years, 18 programs have been published, each with its own strengths and weaknesses. Deciding which tools to use becomes more and more difficult without an understanding of the underlying algorithms. We review these programs, benchmarking six of them, and present their main operating principles. This article is intended to guide researchers in the selection of appropriate tools for specific applications. Finally, we present future challenges in the developing field of contamination detection.

Keywords: Algorithms; Contamination detection; Corroboration; Databases; Genomics; Review.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no conflicts of interest.

Figures

Fig. 1
Fig. 1
Sources of genomic contamination. Three types of issues lead to contamination of genomic sequence data: biological, experimental and computational. The contamination of “pure” cultures can be due to both experimental (e.g. accidental introduction of contaminating microorganisms) and biological causes (e.g. the presence of an endosymbiont). Redundant contamination occurs when a genomic segment is present multiple times in a genome (e.g. multiple SSU rRNAs from different organisms). Non-redundant contamination occurs when a genomic region of the main organism, the expected one, is replaced by the corresponding region of a foreign organism (e.g. the SSU rRNA of the main organism is replaced by the SSU rRNA from a foreign organism). An extra DNA segment, not part of the main organism but belonging to a contaminant, would also be considered as a non-redundant contamination (e.g. eukaryotic DNA in a bacterial genome). A mixed scenario is also possible, as represented in the redundant contamination part of the figure
Fig. 2
Fig. 2
Overview of algorithms. The algorithms are clusterized based on their operating principles, as described in the section “Overview of algorithms”. Squares on the top of the figure represent specific features of the algorithms. Non-redundant means that the software can detect contaminant genes without equivalent in the surveyed genome. Intra-species means that the algorithm can detect contamination at the species level. Inter-domain means that the algorithm can detect prokaryotic and eukaryotic contamination simultaneously. Database features show that the algorithm can use the GTDB Taxonomy and/or a moderately contaminated reference database. Expected organism indicates whether the algorithm can detect the main organism by itself and/or if the user can specify it. Additional functionalities list interesting peculiar functions of the programs, such as outputting the completeness of a genome, cleaning a genome from its contaminants, filtering reads based on their taxonomy (positive filtering), or enriching Multiple Sequence Alignments (MSAs) in orthologous sequences while controlling the taxonomy

References

    1. Orakov A, Fullam A, Coelho LP, Khedkar S, Szklarczyk D, Mende DR, et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 2021;22:178. - PMC - PubMed
    1. Steinegger M, Salzberg SL. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol. 2020;21:115. - PMC - PubMed
    1. Lupo V, Van Vlierberghe M, Vanderschuren H, Kerff F, Baurain D, Cornet L. Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics. Front Microbiol. 2021;12:3233. - PMC - PubMed
    1. Mukherjee S, Huntemann M, Ivanova N, Kyrpides NC, Pati A. Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Stand Genomic Sci. 2015;10:18. - PMC - PubMed
    1. Boothby TC, Tenlen JR, Smith FW, Wang JR, Patanella KA, Nishimura EO, et al. Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. PNAS. 2015;112:15976–15981. - PMC - PubMed

Publication types