Ancestry-agnostic estimation of DNA sample contamination from sequence reads

Fan Zhang^{1

2}, Matthew Flickinger^{1

3}, Sarah A Gagliano Taliun^{1

3}; InPSYght Psychiatric Genetics Consortium; Gonçalo R Abecasis^{1

3}, Laura J Scott^{1

3}, Steven A McCaroll^{4

5}, Carlos N Pato⁶, Michael Boehnke^{1

3}, Hyun Min Kang^{1

3}

Affiliations

¹ Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan 48109-2029, USA.
² Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109-2218, USA.
³ Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109-2029, USA.
⁴ Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA.
⁵ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
⁶ SUNY Downstate Medical Center, Brooklyn, New York 11203, USA.

PMID: 31980570
PMCID: PMC7050530
DOI: 10.1101/gr.246934.118

Ancestry-agnostic estimation of DNA sample contamination from sequence reads

Fan Zhang et al. Genome Res. 2020 Feb.

. 2020 Feb;30(2):185-194.

doi: 10.1101/gr.246934.118. Epub 2020 Jan 24.

Authors

Affiliations

¹ Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan 48109-2029, USA.
² Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109-2218, USA.
³ Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109-2029, USA.
⁴ Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA.
⁵ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
⁶ SUNY Downstate Medical Center, Brooklyn, New York 11203, USA.

PMID: 31980570
PMCID: PMC7050530
DOI: 10.1101/gr.246934.118

Abstract

Detecting and estimating DNA sample contamination are important steps to ensure high-quality genotype calls and reliable downstream analysis. Existing methods rely on population allele frequency information for accurate estimation of contamination rates. Correctly specifying population allele frequencies for each individual in early stage of sequence analysis is impractical or even impossible for large-scale sequencing centers that simultaneously process samples from multiple studies across diverse populations. On the other hand, incorrectly specified allele frequencies may result in substantial bias in estimated contamination rates. For example, we observed that existing methods often fail to identify 10% contaminated samples at a typical 3% contamination exclusion threshold when genetic ancestry is misspecified. Such an incomplete screening of contaminated samples substantially inflates the estimated rate of genotyping errors even in deeply sequenced genomes and exomes. We propose a robust statistical method that accurately estimates DNA contamination and is agnostic to genetic ancestry of the intended or contaminating sample. Our method integrates the estimation of genetic ancestry and DNA contamination in a unified likelihood framework by leveraging individual-specific allele frequencies projected from reference genotypes onto principal component coordinates. Our method can also be used for estimating genetic ancestries, similar to LASER or TRACE, but simultaneously accounting for potential contamination. We demonstrate that our method robustly estimates contamination rates and genetic ancestries across populations and contamination scenarios. We further demonstrate that, in the presence of contamination, genetic ancestry inference can be substantially biased with existing methods that ignore contamination, while our method corrects for such biases.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of *verifyBamID* and *verifyBamID2* software tools. (A) *verifyBamID* takes aligned sequence reads (in BAM format) and known variant sites annotated with population allele frequencies (in VCF format) to estimate DNA contamination rates. When allele frequencies are correctly specified, the estimated DNA contamination rates are expected to be accurate (green boxes). However, when the allele frequencies are misspecified (e.g., due to incorrect self-reported ancestry), the estimates of DNA contamination rates may be biased (red boxes). (B) *verifyBamID2* takes aligned sequence reads (in BAM/CRAM format) and top k singular value decomposition (i.e., PCs and SNP loadings) to estimate the genetic ancestries and contamination rates together. Because *verifyBamID2* does not rely on self-reported ancestry, even if ancestry of sample is misspecified or unknown (red box), the estimated contamination rates will be unbiased (green box). In addition, genetic ancestries are also estimated in PC coordinates, adjusting for potential contamination.

**Figure 2.**
Evaluation of estimated genetic ancestry coordinates, in the absence of contamination, between *TRACE*, LASER, and *verifyBamID2* on samples from the 1000 Genomes low-coverage genome (n = 500, diverse ancestry) sequence data (A,C,E) and from the InPSYght deep genome (n = 500, African-Americans) sequence data (B,D,F). Panels A and B show results from *TRACE*, C and D from LASER, and E and F from *verifyBamID2* (assuming no contamination). Each point represents a sample and each color represents a population ancestry, with the exception that gray points represent PCA coordinates of reference (HGDP) samples.

**Figure 3.**
Impact of DNA sample contamination on the estimation of genetic ancestry. Each point represents a sample. Each gray point represents reference (HGDP) sample and its PCA coordinates, similar to Figure 2. Each colored point represents in silico–contaminated samples across various contamination rates and populations. In panels A, C, and E, European (GBR) and East Asian (CHS) samples are contaminated with African (YRI) samples at different contamination rates (i.e., between-ancestry contamination). In panels B, D, and F, European (GBR) and East Asian (CHS) samples are contaminated with another sample in the same population (i.e., within-ancestry contamination). Different colors represent different contamination rates ranging from 1% to 20%. *Upper* panels (A,B) show *verifyBamID2* estimates without modeling contamination; *middle* panels (C,D), *verifyBamID2* estimates under the assumption that intended and contaminating populations are identical (i.e., equal-ancestry model); *lower* panels (E,F), *verifyBamID2* estimates under the assumption that intended and contaminating populations can be different (i.e., unequal-ancestry model).

**Figure 4.**
Comparison of different models to estimate contamination rates. Horizontal (x) axis shows intended contamination rate, vertical (y) axis shows the ratio of estimated to intended contamination rates. Each color represents different models to estimate contamination rates. EUR_AF, EAS_AF, and AFR_AF represent original *verifyBamID* using European, East Asian, and African allele frequencies across the continental population using the 1000 Genomes data. Pooled_AF represents the original *verifyBamID* using aggregated allele frequencies across all 2504 individuals in the 1000 Genomes Project. Equal_Ancestry represents the *verifyBamID2* assuming that intended and contaminating samples belong to the same population. Unequal_Ancestry represents *verifyBamID2* allowing different genetic ancestry between intended and contaminating sample (recommended setting). Each panel (A–I) represents different combinations of intended (row) and contaminating (column) populations, in the order of GBR, CHS, and YRI.

**Figure 5.**
Comparison of contamination estimation between using *verifyBamID* and *verifyBamID2* on 500 InPSYght samples. All subjects are African-Americans. Each dot represents the pair of contamination rate estimates using different methods. The *left* panel shows the estimated contamination rates of the original *verifyBamID* with pooled allele frequencies, which is the default setting of *verifyBamID* on the x-axis. The y-axis shows *verifyBamID2* with unequal-ancestry model. Each point represents a sequenced subject. The *right* panel compares the estimated contamination rates between two models (unequal-ancestry vs. equal-ancestry) of *verifyBamID2* on the same data set.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. 10.1038/nature15393 - DOI - PMC - PubMed
1. Akaike H. 1974. A new look at the statistical model identification. IEEE Trans Automat Contr 19: 716–723. 10.1109/TAC.1974.1100705 - DOI
1. Alexander DH, Novembre J, Lange K. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19: 1655–1664. 10.1101/gr.094052.109 - DOI - PMC - PubMed
1. Brent RP. 1973. Algorithms for minimization without derivatives. Prentice-Hall, Englewood Cliffs, NJ.
1. Cavalli-Sforza LL. 2005. The Human Genome Diversity Project: past, present and future. Nat Rev Genet 6: 333–340. 10.1038/nrg1596 - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ancestry-agnostic estimation of DNA sample contamination from sequence reads

Affiliations

Ancestry-agnostic estimation of DNA sample contamination from sequence reads

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources