Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb;30(2):185-194.
doi: 10.1101/gr.246934.118. Epub 2020 Jan 24.

Ancestry-agnostic estimation of DNA sample contamination from sequence reads

Affiliations

Ancestry-agnostic estimation of DNA sample contamination from sequence reads

Fan Zhang et al. Genome Res. 2020 Feb.

Abstract

Detecting and estimating DNA sample contamination are important steps to ensure high-quality genotype calls and reliable downstream analysis. Existing methods rely on population allele frequency information for accurate estimation of contamination rates. Correctly specifying population allele frequencies for each individual in early stage of sequence analysis is impractical or even impossible for large-scale sequencing centers that simultaneously process samples from multiple studies across diverse populations. On the other hand, incorrectly specified allele frequencies may result in substantial bias in estimated contamination rates. For example, we observed that existing methods often fail to identify 10% contaminated samples at a typical 3% contamination exclusion threshold when genetic ancestry is misspecified. Such an incomplete screening of contaminated samples substantially inflates the estimated rate of genotyping errors even in deeply sequenced genomes and exomes. We propose a robust statistical method that accurately estimates DNA contamination and is agnostic to genetic ancestry of the intended or contaminating sample. Our method integrates the estimation of genetic ancestry and DNA contamination in a unified likelihood framework by leveraging individual-specific allele frequencies projected from reference genotypes onto principal component coordinates. Our method can also be used for estimating genetic ancestries, similar to LASER or TRACE, but simultaneously accounting for potential contamination. We demonstrate that our method robustly estimates contamination rates and genetic ancestries across populations and contamination scenarios. We further demonstrate that, in the presence of contamination, genetic ancestry inference can be substantially biased with existing methods that ignore contamination, while our method corrects for such biases.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of verifyBamID and verifyBamID2 software tools. (A) verifyBamID takes aligned sequence reads (in BAM format) and known variant sites annotated with population allele frequencies (in VCF format) to estimate DNA contamination rates. When allele frequencies are correctly specified, the estimated DNA contamination rates are expected to be accurate (green boxes). However, when the allele frequencies are misspecified (e.g., due to incorrect self-reported ancestry), the estimates of DNA contamination rates may be biased (red boxes). (B) verifyBamID2 takes aligned sequence reads (in BAM/CRAM format) and top k singular value decomposition (i.e., PCs and SNP loadings) to estimate the genetic ancestries and contamination rates together. Because verifyBamID2 does not rely on self-reported ancestry, even if ancestry of sample is misspecified or unknown (red box), the estimated contamination rates will be unbiased (green box). In addition, genetic ancestries are also estimated in PC coordinates, adjusting for potential contamination.
Figure 2.
Figure 2.
Evaluation of estimated genetic ancestry coordinates, in the absence of contamination, between TRACE, LASER, and verifyBamID2 on samples from the 1000 Genomes low-coverage genome (n = 500, diverse ancestry) sequence data (A,C,E) and from the InPSYght deep genome (n = 500, African-Americans) sequence data (B,D,F). Panels A and B show results from TRACE, C and D from LASER, and E and F from verifyBamID2 (assuming no contamination). Each point represents a sample and each color represents a population ancestry, with the exception that gray points represent PCA coordinates of reference (HGDP) samples.
Figure 3.
Figure 3.
Impact of DNA sample contamination on the estimation of genetic ancestry. Each point represents a sample. Each gray point represents reference (HGDP) sample and its PCA coordinates, similar to Figure 2. Each colored point represents in silico–contaminated samples across various contamination rates and populations. In panels A, C, and E, European (GBR) and East Asian (CHS) samples are contaminated with African (YRI) samples at different contamination rates (i.e., between-ancestry contamination). In panels B, D, and F, European (GBR) and East Asian (CHS) samples are contaminated with another sample in the same population (i.e., within-ancestry contamination). Different colors represent different contamination rates ranging from 1% to 20%. Upper panels (A,B) show verifyBamID2 estimates without modeling contamination; middle panels (C,D), verifyBamID2 estimates under the assumption that intended and contaminating populations are identical (i.e., equal-ancestry model); lower panels (E,F), verifyBamID2 estimates under the assumption that intended and contaminating populations can be different (i.e., unequal-ancestry model).
Figure 4.
Figure 4.
Comparison of different models to estimate contamination rates. Horizontal (x) axis shows intended contamination rate, vertical (y) axis shows the ratio of estimated to intended contamination rates. Each color represents different models to estimate contamination rates. EUR_AF, EAS_AF, and AFR_AF represent original verifyBamID using European, East Asian, and African allele frequencies across the continental population using the 1000 Genomes data. Pooled_AF represents the original verifyBamID using aggregated allele frequencies across all 2504 individuals in the 1000 Genomes Project. Equal_Ancestry represents the verifyBamID2 assuming that intended and contaminating samples belong to the same population. Unequal_Ancestry represents verifyBamID2 allowing different genetic ancestry between intended and contaminating sample (recommended setting). Each panel (AI) represents different combinations of intended (row) and contaminating (column) populations, in the order of GBR, CHS, and YRI.
Figure 5.
Figure 5.
Comparison of contamination estimation between using verifyBamID and verifyBamID2 on 500 InPSYght samples. All subjects are African-Americans. Each dot represents the pair of contamination rate estimates using different methods. The left panel shows the estimated contamination rates of the original verifyBamID with pooled allele frequencies, which is the default setting of verifyBamID on the x-axis. The y-axis shows verifyBamID2 with unequal-ancestry model. Each point represents a sequenced subject. The right panel compares the estimated contamination rates between two models (unequal-ancestry vs. equal-ancestry) of verifyBamID2 on the same data set.

References

    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. 10.1038/nature15393 - DOI - PMC - PubMed
    1. Akaike H. 1974. A new look at the statistical model identification. IEEE Trans Automat Contr 19: 716–723. 10.1109/TAC.1974.1100705 - DOI
    1. Alexander DH, Novembre J, Lange K. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19: 1655–1664. 10.1101/gr.094052.109 - DOI - PMC - PubMed
    1. Brent RP. 1973. Algorithms for minimization without derivatives. Prentice-Hall, Englewood Cliffs, NJ.
    1. Cavalli-Sforza LL. 2005. The Human Genome Diversity Project: past, present and future. Nat Rev Genet 6: 333–340. 10.1038/nrg1596 - DOI - PubMed

Publication types

LinkOut - more resources