Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data

Goo Jun¹, Matthew Flickinger, Kurt N Hetrick, Jane M Romm, Kimberly F Doheny, Gonçalo R Abecasis, Michael Boehnke, Hyun Min Kang

Affiliations

PMID: 23103226
PMCID: PMC3487130
DOI: 10.1016/j.ajhg.2012.09.004

Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data

Goo Jun et al. Am J Hum Genet. 2012.

. 2012 Nov 2;91(5):839-48.

doi: 10.1016/j.ajhg.2012.09.004. Epub 2012 Oct 25.

Authors

Goo Jun¹, Matthew Flickinger, Kurt N Hetrick, Jane M Romm, Kimberly F Doheny, Gonçalo R Abecasis, Michael Boehnke, Hyun Min Kang

Affiliation

¹ Department of Biostatistics and Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA.

PMID: 23103226
PMCID: PMC3487130
DOI: 10.1016/j.ajhg.2012.09.004

Abstract

DNA sample contamination is a serious problem in DNA sequencing studies and may result in systematic genotype misclassification and false positive associations. Although methods exist to detect and filter out cross-species contamination, few methods to detect within-species sample contamination are available. In this paper, we describe methods to identify within-species DNA sample contamination based on (1) a combination of sequencing reads and array-based genotype data, (2) sequence reads alone, and (3) array-based genotype data alone. Analysis of sequencing reads allows contamination detection after sequence data is generated but prior to variant calling; analysis of array-based genotype data allows contamination detection prior to generation of costly sequence data. Through a combination of analysis of in silico and experimentally contaminated samples, we show that our methods can reliably detect and estimate levels of contamination as low as 1%. We evaluate the impact of DNA contamination on genotype accuracy and propose effective strategies to screen for and prevent DNA contamination in sequencing studies.

PubMed Disclaimer

Figures

**Figure 1**
SNP Genotype Calling and Estimation of Contamination from 299 European Sequenced Samples Across chromosome 20 (A) Numbers of heterozygous genotypes. (B) Ratio of the numbers of nonreference homozygous genotypes to heterozygous genotypes (HET/HOM ratio). (C) Estimated level of DNA sample contamination estimated from sequence data only.

**Figure 2**
Distribution of Array Intensity for Contaminated and Uncontaminated Samples BAF versus population MAF for (A) uncontaminated (α = 0) and (B) contaminated (α = 10%) samples. Normalized intensity plots for (C) uncontaminated (α = 0) and (D) contaminated (α = 10%) samples.

**Figure 3**
Estimated Contamination Levels for In Silico Contaminated Samples (A) Joint sequence and array-based method, (B) sequence-only method, and (C) between these two methods.

**Figure 4**
Estimated Versus Intended Contamination Levels from the Experimentally Contaminated Array Intensity Data Three methods—regression-based method, multisample mixture model method, and single-sample mixture model method—were compared in two populations (CEU and YRI).

**Figure 5**
Estimated Contamination Levels between Sequence-Based Methods Comparison of estimated contamination levels using sequence data with and without array genotype data for type 2 diabetes sequencing study.

**Figure 6**
Genotype Discordance between Sequence-Based and Array-Based Genotypes A function of estimated contamination level $\hat{α}$ in the type 2 diabetes sequencing study; contamination level estimates based on the combined sequence and genotype array data, stratified by genotypes from HumanOmni2.5 array data. (A) Homozygous reference genotypes, (B) heterozygous genotypes, and (C) homozygous nonreference genotypes.

See this image and copyright information in PMC

References

1. Schmieder R.A.E., Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS ONE. 2011;6:e17288. - PMC - PubMed
1. Cibulskis K., McKenna A., Fennell T., Banks E., DePristo M., Getz G. ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics. 2011;27:2601–2602. - PMC - PubMed
1. Brent R.P. Dover Publications; New York: 2002. Algorithms for Minimization without Derivatives.
1. Gordon D., Yang Y., Haynes C., Finch S.J., Mendell N.R., Brown A.M., Haroutunian V. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat. Appl. Genet. Mol. Biol. 2004;3:e26. - PubMed
1. Voight B.F., Kang H.M., Ding J., Palmer C.D., Sidore C., Chines P.S., Burtt N.P., Fuchsberger C., Li Y., Erdmann J. The metabochip a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 2012;8:e1002793. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data

Affiliation

Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases