Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar 29;15(1):25.
doi: 10.1186/s12915-017-0366-6.

Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions

Affiliations

Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions

Marion Ballenghien et al. BMC Biol. .

Abstract

Background: Contamination is a well-known but often neglected problem in molecular biology. Here, we investigated the prevalence of cross-contamination among 446 samples from 116 distinct species of animals, which were processed in the same laboratory and subjected to subcontracted transcriptome sequencing.

Results: Using cytochrome oxidase 1 as a barcode, we identified a minimum of 782 events of between-species contamination, with approximately 80% of our samples being affected. An analysis of laboratory metadata revealed a strong effect of the sequencing center: nearly all the detected events of between-species contamination involved species that were sent the same day to the same company. We introduce new methods to address the amount of within-species, between-individual contamination, and to correct for this problem when calling genotypes from base read counts.

Conclusions: We report evidence for pervasive within-species contamination in this data set, and show that classical population genomic statistics, such as synonymous diversity, the ratio of non-synonymous to synonymous diversity, inbreeding coefficient FIT, and Tajima's D, are sensitive to this problem to various extents. Control analyses suggest that our published results are probably robust to the problem of contamination. Recommendations on how to prevent or avoid contamination in large-scale population genomics/molecular ecology are provided based on this analysis.

Keywords: Animals; Genotyping; RNAseq; SNP calling; Transcriptome; Within-species.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Detection of within-species contamination through homo-quartet analysis. Each multicolored square represents a quartet, that is, read counts for states A (green), C (yellow), G (blue), and T (orange) at a specific position in a specific individual, zeros being omitted. A fictive dataset of four individuals (Ind 1 to Ind 4) and five positions (Pos 1 to Pos 5) is shown. At all five positions, the quartet for individual Ind1 is a homo-quartet (thick borders): the major state has more than 40 reads, and the minor state has exactly one read. Positions Pos1 and Pos2 are monoallelic: the major state represents more than 95% of reads across the four individuals. These two positions inform on the contamination-free error pattern. Positions Pos3, Pos4, and Pos5 are biallelic: besides the major state, another allele segregates in the sample. At Pos3 the Ind1, the minor state (G) differs from the other segregating allele (C); this error cannot result from within-species contamination. At Pos4 and Pos5, the Ind1 minor state is identical to the other segregating allele (T), potentially reflecting allele leakage between individuals, as indicated by red arrows. The proportions of these different types of position inform on the prevalence of within-species contamination
Fig. 2
Fig. 2
Overall pattern of between-species contamination. a Among-sample distribution of the prevalence of reads mapping to a cox1 reference from the expected (gray) or an unexpected (red) species. Prevalence is defined as the number of cox1 reads per million reads. b Relationship between the prevalence of cox1 reads mapping to the expected (x-axis) vs. an unexpected (y-axis) species, again per million reads. Each dot represents a sample. Plain line: ratio of unexpected to expected cox1 reads is one. Dotted lines: ratio of unexpected to expected cox1 reads is 0.1 (respectively, 0.01). Samples from species not represented in our cox1 reference database are not shown
Fig. 3
Fig. 3
Effect of laboratory metadata on the probability of between-species contamination. Four statistics are shown: lab_overlap (top left), same_technician (top right), same_shipment (bottom left), same_flowcell (bottom right). x-axis: average value of each statistics. Vertical red line: actual data set. y-axis: number of randomized data sets (out of 1000). White histograms: expected distribution assuming random probability of contamination. Blue histograms: expected distribution assuming that contamination is dependent on same_shipment. Green histograms: expected distribution assuming that contamination is dependent on lab_overlap and same_technician
Fig. 4
Fig. 4
Robustness of population genomic estimates to contamination-aware single-nucleotide polymorphism (SNP) calling. a Synonymous diversity πS; b ratio of non-synonymous to synonymous diversity, πNS; c FIT; d Tajima’s D, synonymous SNPs only. Each dot represents a species. x-axis: estimates obtained assuming no contamination. y-axis: estimates obtained from contamination-aware SNP calling. Black dots: γ = 0.05; blue dots: γ = 0.1; red dots: γ = 0.2 synonymous diversity πS; top right: πNS ratio; bottom left: FIT; bottom right: Tajima’s D, synonymous SNP’s only

References

    1. Walden KK, Robertson HM. Ancient DNA from amber fossil bees? Mol Biol Evol. 1997;14:1075–7. doi: 10.1093/oxfordjournals.molbev.a025713. - DOI - PubMed
    1. Willerslev E, Mourier T, Hansen AJ, Christensen B, Barnes I, Salzberg SL. Contamination in the draft of the human genome masquerades as lateral gene transfer. DNA Seq. 2002;13:75–6. doi: 10.1080/10425170290023392. - DOI - PubMed
    1. Salas A, Yao YG, Macaulay V, Vega A, Carracedo A, Bandelt HJ. A critical reassessment of the role of mitochondria in tumorigenesis. PLoS Med. 2005;2:e296. doi: 10.1371/journal.pmed.0020296. - DOI - PMC - PubMed
    1. Smith RA. Contamination of clinical specimens with MLV-encoding nucleic acids: implications for XMRV and other candidate human retroviruses. Retrovirology. 2010;7:112. doi: 10.1186/1742-4690-7-112. - DOI - PMC - PubMed
    1. Lusk RW. Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. PLoS One. 2014;9:e110808. doi: 10.1371/journal.pone.0110808. - DOI - PMC - PubMed

Publication types

Substances