Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Nov 2;91(5):839-48.
doi: 10.1016/j.ajhg.2012.09.004. Epub 2012 Oct 25.

Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data

Affiliations

Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data

Goo Jun et al. Am J Hum Genet. .

Abstract

DNA sample contamination is a serious problem in DNA sequencing studies and may result in systematic genotype misclassification and false positive associations. Although methods exist to detect and filter out cross-species contamination, few methods to detect within-species sample contamination are available. In this paper, we describe methods to identify within-species DNA sample contamination based on (1) a combination of sequencing reads and array-based genotype data, (2) sequence reads alone, and (3) array-based genotype data alone. Analysis of sequencing reads allows contamination detection after sequence data is generated but prior to variant calling; analysis of array-based genotype data allows contamination detection prior to generation of costly sequence data. Through a combination of analysis of in silico and experimentally contaminated samples, we show that our methods can reliably detect and estimate levels of contamination as low as 1%. We evaluate the impact of DNA contamination on genotype accuracy and propose effective strategies to screen for and prevent DNA contamination in sequencing studies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
SNP Genotype Calling and Estimation of Contamination from 299 European Sequenced Samples Across chromosome 20 (A) Numbers of heterozygous genotypes. (B) Ratio of the numbers of nonreference homozygous genotypes to heterozygous genotypes (HET/HOM ratio). (C) Estimated level of DNA sample contamination estimated from sequence data only.
Figure 2
Figure 2
Distribution of Array Intensity for Contaminated and Uncontaminated Samples BAF versus population MAF for (A) uncontaminated (α = 0) and (B) contaminated (α = 10%) samples. Normalized intensity plots for (C) uncontaminated (α = 0) and (D) contaminated (α = 10%) samples.
Figure 3
Figure 3
Estimated Contamination Levels for In Silico Contaminated Samples (A) Joint sequence and array-based method, (B) sequence-only method, and (C) between these two methods.
Figure 4
Figure 4
Estimated Versus Intended Contamination Levels from the Experimentally Contaminated Array Intensity Data Three methods—regression-based method, multisample mixture model method, and single-sample mixture model method—were compared in two populations (CEU and YRI).
Figure 5
Figure 5
Estimated Contamination Levels between Sequence-Based Methods Comparison of estimated contamination levels using sequence data with and without array genotype data for type 2 diabetes sequencing study.
Figure 6
Figure 6
Genotype Discordance between Sequence-Based and Array-Based Genotypes A function of estimated contamination level αˆ in the type 2 diabetes sequencing study; contamination level estimates based on the combined sequence and genotype array data, stratified by genotypes from HumanOmni2.5 array data. (A) Homozygous reference genotypes, (B) heterozygous genotypes, and (C) homozygous nonreference genotypes.

References

    1. Schmieder R.A.E., Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS ONE. 2011;6:e17288. - PMC - PubMed
    1. Cibulskis K., McKenna A., Fennell T., Banks E., DePristo M., Getz G. ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics. 2011;27:2601–2602. - PMC - PubMed
    1. Brent R.P. Dover Publications; New York: 2002. Algorithms for Minimization without Derivatives.
    1. Gordon D., Yang Y., Haynes C., Finch S.J., Mendell N.R., Brown A.M., Haroutunian V. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat. Appl. Genet. Mol. Biol. 2004;3:e26. - PubMed
    1. Voight B.F., Kang H.M., Ding J., Palmer C.D., Sidore C., Chines P.S., Burtt N.P., Fuchsberger C., Li Y., Erdmann J. The metabochip a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 2012;8:e1002793. - PMC - PubMed

Publication types