. 2014 Oct 29;9(10):e110808.

doi: 10.1371/journal.pone.0110808. eCollection 2014.

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Richard W Lusk¹

Affiliations

PMID: 25354084
PMCID: PMC4213012
DOI: 10.1371/journal.pone.0110808

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Richard W Lusk. PLoS One. 2014.

. 2014 Oct 29;9(10):e110808.

doi: 10.1371/journal.pone.0110808. eCollection 2014.

Author

Richard W Lusk¹

Affiliation

¹ Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America.

PMID: 25354084
PMCID: PMC4213012
DOI: 10.1371/journal.pone.0110808

Abstract

Trace quantities of contaminating DNA are widespread in the laboratory environment, but their presence has received little attention in the context of high throughput sequencing. This issue is highlighted by recent works that have rested controversial claims upon sequencing data that appear to support the presence of unexpected exogenous species. I used reads that preferentially aligned to alternate genomes to infer the distribution of potential contaminant species in a set of independent sequencing experiments. I confirmed that dilute samples are more exposed to contaminating DNA, and, focusing on four single-cell sequencing experiments, found that these contaminants appear to originate from a wide diversity of clades. Although negative control libraries prepared from 'blank' samples recovered the highest-frequency contaminants, low-frequency contaminants, which appeared to make heterogeneous contributions to samples prepared in parallel within a single experiment, were not well controlled for. I used these results to show that, despite heavy replication and plausible controls, contamination can explain all of the observations used to support a recent claim that complete genes pass from food to human blood. Contamination must be considered a potential source of signals of exogenous species in sequencing data, even if these signals are replicated in independent experiments, vary across conditions, or indicate a species which seems a priori unlikely to contaminate. Negative control libraries processed in parallel are essential to control for contaminant DNAs, but their limited ability to recover low-frequency contaminants must be recognized.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The author has declared that no competing interests exist.

Figures

**Figure 1. Reads matching the human genome are more prevalent in libraries prepared from dilute samples.**
(a) The fraction of paired-end reads which preferentially map to the contaminant human genome instead of the *E. coli* K-12 genome, measured against the total number of reads in the library, is plotted against the amount of *E. coli* K-12 DNA used per tagmentation procedure as described by Parkinson et al . Shading is used to highlight closely overlapping points (n = 4, 3, and 3 for the 1ng, 100pg, and 10pg libraries, respectively). Libraries listed at each concentration were not identically prepared, each using a different restriction enzyme or set of restriction enzymes at an intermediate step in the protocol (Additional File 1, Table S1), but the number and composition of enzymes used did not appreciably change the number of contaminant reads recovered. (b) The same fraction is plotted for a library prepared in the same experiment using a standard Illumina library preparation protocol. Despite a higher concentration of input DNA, an intermediate number of contaminant reads was detected.

**Figure 2. Reads that do not map to the reference genome match a diverse array of clades.**
For each experiment (“Tumor” , “RNAseq” , “Sperm” , and “Strandseq” [24]), all reads were individually mapped to the appropriate reference genome using permissive parameters before being used to query the NR database using BLAST. BLAST hits were considered “perfect” if they matched with 100% identity over a dataset-specific length threshold (see Methods). A read was assigned to one of the depicted phylogenetic categories if it did not map or have a perfect BLAST hit to the reference genome, had a perfect BLAST hit to a species in that category, and had no BLAST hits to species outside that category. For each category and experiment, the fraction of reads meeting this criteria against the total number of reads in the experiment is depicted.

**Figure 3. Experiment-specific correlation between distributions of genera recovered from ‘blank’ and positive samples.**
The “RNAseq” and “Strandseq” experiments sequenced libraries prepared from blank samples into which no cells had been introduced. Reads from these blank samples and from all other samples were separately pooled, screened against the mouse reference genome, and queried against the BLAST NR database. Reads were screened using the same criteria as described for Figure 2, but adjusted to the genus taxonomic level. The number of reads matching each genus in each dataset was counted, incremented by one, and log transformed. Values for the pooled positive samples (StrandSeq and RNAseq in rows A-B and C-D, respectively) are plotted with their Pearson correlation against values for the pooled negative samples (Strandseq and RNAseq in columns A-C and B-D, respectively). Matched positive and negative samples in (B) and (C) exhibit more correlated read counts than do mismatched positive and negative samples in (A) and (D).

**Figure 4. Heterogeneous species appear to contaminant samples from the same tissue and experiment.**
The “Tumor” experiment dissociated 100 individual cells from a sample of a single tumor and sequenced libraries from each. Following the analysis pipeline of a study that claimed to find different plant species in different blood plasma samples from a single experiment, I used bowtie to screen each read in each library against the human reference genome before using it to query a database of chloroplast genomes. The number of such hits to each genome is depicted here, each count incremented by one and log-transformed. Only chloroplast genomes with at least 200 hits are shown. Rows and columns were clustered using a neighbor-joining algorithm.

**Figure 5. Reads matching the tomato chloroplast genome are less frequent than other contaminant matching reads in samples of cell-free DNA.**
Spisak et al used the frequency of reads matching chloroplast genomes as evidence that genes pass intact from food to the bloodstream, and found *S. lycopersicum* (tomato) to be the most common contaminant. Evenly sampling from all of the sequencing samples generated by Spisak et al, I used identical criteria to investigate potential matches to three other contaminant species, *E. coli, P. acnes* and *M. globosa. P. acnes* and *M. globosa* are associated with the human skin flora. The frequency of reads matching each of these contaminant reads, per million reads in the pooled samples, is depicted.

See this image and copyright information in PMC

References

1. Schmidt T, Hummel S, Herrmann B (1995) Evidence of contamination in PCR laboratory disposables. Naturwissenschaften 82(9): 423–31. - PubMed
1. Leonard JA, Shanks O, Hofreiter M, Kreuz E, Hodges L, et al. (2007) Animal DNA in PCR reagents plagues ancient DNA research. Journal of Archaeological Science 34(9): 1361–6.
1. Peters RP, Mohammadi T, Vandenbroucke Grauls CM, Danner SA, van Agtmael MA, et al. (2004) Detection of bacterial DNA in blood samples from febrile patients: underestimated infection or emerging contamination? FEMS Immunol Med Microbiol 42(2): 249–53. - PubMed
1. Ehricht R, Hotzel H, Sachse K, Slickers P (2007) Residual DNA in thermostable DNA polymerases - a cause of irritation in diagnostic PCR and microarray assays. Biologicals 35(2): 145–7. - PubMed
1. Evans GE, Murdoch DR, Anderson TP, Potter HC, George PM, et al. (2003) Contamination of Qiagen DNA extraction kits with Legionella DNA. J Clin Microbiol 41(7): 3452–3. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Affiliation

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Author

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources