Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Oct 29;9(10):e110808.
doi: 10.1371/journal.pone.0110808. eCollection 2014.

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Affiliations

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Richard W Lusk. PLoS One. .

Abstract

Trace quantities of contaminating DNA are widespread in the laboratory environment, but their presence has received little attention in the context of high throughput sequencing. This issue is highlighted by recent works that have rested controversial claims upon sequencing data that appear to support the presence of unexpected exogenous species. I used reads that preferentially aligned to alternate genomes to infer the distribution of potential contaminant species in a set of independent sequencing experiments. I confirmed that dilute samples are more exposed to contaminating DNA, and, focusing on four single-cell sequencing experiments, found that these contaminants appear to originate from a wide diversity of clades. Although negative control libraries prepared from 'blank' samples recovered the highest-frequency contaminants, low-frequency contaminants, which appeared to make heterogeneous contributions to samples prepared in parallel within a single experiment, were not well controlled for. I used these results to show that, despite heavy replication and plausible controls, contamination can explain all of the observations used to support a recent claim that complete genes pass from food to human blood. Contamination must be considered a potential source of signals of exogenous species in sequencing data, even if these signals are replicated in independent experiments, vary across conditions, or indicate a species which seems a priori unlikely to contaminate. Negative control libraries processed in parallel are essential to control for contaminant DNAs, but their limited ability to recover low-frequency contaminants must be recognized.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The author has declared that no competing interests exist.

Figures

Figure 1
Figure 1. Reads matching the human genome are more prevalent in libraries prepared from dilute samples.
(a) The fraction of paired-end reads which preferentially map to the contaminant human genome instead of the E. coli K-12 genome, measured against the total number of reads in the library, is plotted against the amount of E. coli K-12 DNA used per tagmentation procedure as described by Parkinson et al . Shading is used to highlight closely overlapping points (n = 4, 3, and 3 for the 1ng, 100pg, and 10pg libraries, respectively). Libraries listed at each concentration were not identically prepared, each using a different restriction enzyme or set of restriction enzymes at an intermediate step in the protocol (Additional File 1, Table S1), but the number and composition of enzymes used did not appreciably change the number of contaminant reads recovered. (b) The same fraction is plotted for a library prepared in the same experiment using a standard Illumina library preparation protocol. Despite a higher concentration of input DNA, an intermediate number of contaminant reads was detected.
Figure 2
Figure 2. Reads that do not map to the reference genome match a diverse array of clades.
For each experiment (“Tumor” , “RNAseq” , “Sperm” , and “Strandseq” [24]), all reads were individually mapped to the appropriate reference genome using permissive parameters before being used to query the NR database using BLAST. BLAST hits were considered “perfect” if they matched with 100% identity over a dataset-specific length threshold (see Methods). A read was assigned to one of the depicted phylogenetic categories if it did not map or have a perfect BLAST hit to the reference genome, had a perfect BLAST hit to a species in that category, and had no BLAST hits to species outside that category. For each category and experiment, the fraction of reads meeting this criteria against the total number of reads in the experiment is depicted.
Figure 3
Figure 3. Experiment-specific correlation between distributions of genera recovered from ‘blank’ and positive samples.
The “RNAseq” and “Strandseq” experiments sequenced libraries prepared from blank samples into which no cells had been introduced. Reads from these blank samples and from all other samples were separately pooled, screened against the mouse reference genome, and queried against the BLAST NR database. Reads were screened using the same criteria as described for Figure 2, but adjusted to the genus taxonomic level. The number of reads matching each genus in each dataset was counted, incremented by one, and log transformed. Values for the pooled positive samples (StrandSeq and RNAseq in rows A-B and C-D, respectively) are plotted with their Pearson correlation against values for the pooled negative samples (Strandseq and RNAseq in columns A-C and B-D, respectively). Matched positive and negative samples in (B) and (C) exhibit more correlated read counts than do mismatched positive and negative samples in (A) and (D).
Figure 4
Figure 4. Heterogeneous species appear to contaminant samples from the same tissue and experiment.
The “Tumor” experiment dissociated 100 individual cells from a sample of a single tumor and sequenced libraries from each. Following the analysis pipeline of a study that claimed to find different plant species in different blood plasma samples from a single experiment, I used bowtie to screen each read in each library against the human reference genome before using it to query a database of chloroplast genomes. The number of such hits to each genome is depicted here, each count incremented by one and log-transformed. Only chloroplast genomes with at least 200 hits are shown. Rows and columns were clustered using a neighbor-joining algorithm.
Figure 5
Figure 5. Reads matching the tomato chloroplast genome are less frequent than other contaminant matching reads in samples of cell-free DNA.
Spisak et al used the frequency of reads matching chloroplast genomes as evidence that genes pass intact from food to the bloodstream, and found S. lycopersicum (tomato) to be the most common contaminant. Evenly sampling from all of the sequencing samples generated by Spisak et al, I used identical criteria to investigate potential matches to three other contaminant species, E. coli, P. acnes and M. globosa. P. acnes and M. globosa are associated with the human skin flora. The frequency of reads matching each of these contaminant reads, per million reads in the pooled samples, is depicted.

References

    1. Schmidt T, Hummel S, Herrmann B (1995) Evidence of contamination in PCR laboratory disposables. Naturwissenschaften 82(9): 423–31. - PubMed
    1. Leonard JA, Shanks O, Hofreiter M, Kreuz E, Hodges L, et al. (2007) Animal DNA in PCR reagents plagues ancient DNA research. Journal of Archaeological Science 34(9): 1361–6.
    1. Peters RP, Mohammadi T, Vandenbroucke Grauls CM, Danner SA, van Agtmael MA, et al. (2004) Detection of bacterial DNA in blood samples from febrile patients: underestimated infection or emerging contamination? FEMS Immunol Med Microbiol 42(2): 249–53. - PubMed
    1. Ehricht R, Hotzel H, Sachse K, Slickers P (2007) Residual DNA in thermostable DNA polymerases - a cause of irritation in diagnostic PCR and microarray assays. Biologicals 35(2): 145–7. - PubMed
    1. Evans GE, Murdoch DR, Anderson TP, Potter HC, George PM, et al. (2003) Contamination of Qiagen DNA extraction kits with Legionella DNA. J Clin Microbiol 41(7): 3452–3. - PMC - PubMed

Publication types

MeSH terms