. 2011 Mar 9;6(3):e17288.

doi: 10.1371/journal.pone.0017288.

Fast identification and removal of sequence contamination from genomic and metagenomic datasets

Robert Schmieder¹, Robert Edwards

Affiliations

PMID: 21408061
PMCID: PMC3052304
DOI: 10.1371/journal.pone.0017288

Fast identification and removal of sequence contamination from genomic and metagenomic datasets

Robert Schmieder et al. PLoS One. 2011.

. 2011 Mar 9;6(3):e17288.

doi: 10.1371/journal.pone.0017288.

Authors

Robert Schmieder¹, Robert Edwards

Affiliation

¹ Department of Computer Science, San Diego State University, San Diego, California, United States of America. rschmied@sciences.sdsu.edu

PMID: 21408061
PMCID: PMC3052304
DOI: 10.1371/journal.pone.0017288

Abstract

High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets (150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Alignment sensitivity of BWA-SW for human sequences.**
Query coverage and alignment identity values ranged from 90% to 100%. The sensitivity shows how many sequences could be aligned back to the reference. The simulated datasets contained 28,612,955 reads for 200 bp, 11,444,886 reads for 500 bp, and 5,722,210 reads for 1,000 bp.

**Figure 2. Repeats causing alignment problems for BWA-SW.**
The query coverage was set to 95%, with identity set to 99%, 97% and 94% for error rates of 0%, 2% and 5%, respectively. The numbers above the bars show the number of unaligned sequences of each category for the given thresholds. The values shown in parenthesis represent the percentage of unaligned sequences. The simulated datasets contained 28,612,955 reads for 200 bp, 11,444,886 reads for 500 bp, and 5,722,210 reads for 1,000 bp.

**Figure 3. DeconSeq web interface.**
Screenshots of the DeconSeq web interface at different steps of the data processing. The user can either input a data ID to access already processed data (A) or input a new sequence file and select the database (B). After processing the data, the results are shown including the input information (C), Coverage vs. Identity plots for “remove” databases (D) and “retain” databases (E), classification of input data into “clean”, “contamination”, and “both” (F), and download options (G).

**Figure 4. Coverage vs. Identity plots generated by DeconSeq.**
The plots show the number of matching reads for different query coverage and alignment identity values. The size of each dot in the plots is defined by the number of matching reads with exactly this coverage and identity value. Red dots represent matching reads against the “remove” databases and blue dots against “retain” databases. The column and row sums at the top and right of each plot allow an easier identification of the number of sequences that match for a particular threshold value. The plots for matching reads against the “remove” databases do not show matching reads that additionally have a match against the “retain” databases (A). Results for reads matching against both databases are shown in a second plot where dots for a single read are connected by lines. If the match against the “remove” database is more similar, then the line is colored red, otherwise blue. In B, for example, the majority of sequences is more similar to the “retain” databases and in C the majority is more similar to the “remove” databases.

**Figure 5. Result of human DNA contamination identified in 202 metagenomes.**
All seven human genome sequences were used as “remove” databases and depending on the metagenome type (viral or microbial), the viral or bacterial genomes were selected as “retain” database. 145 (72%) of the metagenomes contained at least one possible contamination sequence using a threshold of 95% query coverage and 94% alignment identity.

**Figure 6. Flowchart of DeconSeq for the identification of possible contaminant sequences.**

See this image and copyright information in PMC

References

1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. Comparative metagenomics of microbial communities. Science. 2005;308:554–557. - PubMed
1. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician's guide to metagenomics. Microbiology and Molecular Biology Reviews. 2008;72:557–578. - PMC - PubMed
1. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, et al. Functional metagenomic profiling of nine biomes. Nature. 2008;452:629–632. - PubMed
1. Rosen GL, Sokhansanj BA, Polikar R, Bruns MA, Russell J, et al. Signal processing for metagenomics: extracting information from the soup. Current Genomics. 2009;10:493–510. - PMC - PubMed
1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast identification and removal of sequence contamination from genomic and metagenomic datasets

Affiliation

Fast identification and removal of sequence contamination from genomic and metagenomic datasets

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical