. 2023 Dec 7;110(12):2068-2076.

doi: 10.1016/j.ajhg.2023.10.011. Epub 2023 Nov 23.

CHARR efficiently estimates contamination from DNA sequencing data

Affiliations

¹ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
² Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
³ Human Technopole, Viale Rita Levi-Montalcini 1, 20157 Milano, Italy.
⁴ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA.
⁵ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Molecular Medicine Finland, Helsinki, Finland.
⁶ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
⁷ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. Electronic address: konradk@broadinstitute.org.

PMID: 38000370
PMCID: PMC10716339
DOI: 10.1016/j.ajhg.2023.10.011

CHARR efficiently estimates contamination from DNA sequencing data

Wenhan Lu et al. Am J Hum Genet. 2023.

. 2023 Dec 7;110(12):2068-2076.

doi: 10.1016/j.ajhg.2023.10.011. Epub 2023 Nov 23.

Authors

Affiliations

¹ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
² Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
³ Human Technopole, Viale Rita Levi-Montalcini 1, 20157 Milano, Italy.
⁴ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA.
⁵ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Molecular Medicine Finland, Helsinki, Finland.
⁶ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
⁷ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. Electronic address: konradk@broadinstitute.org.

PMID: 38000370
PMCID: PMC10716339
DOI: 10.1016/j.ajhg.2023.10.011

Abstract

DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.

Keywords: DNA sequencing; contamination; data science; genetic research; quality control; variant calling.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests M.J.D. is a founder of Maze Therapeutics and Neumora Therapeutics, Inc. (f/k/a RBNC Therapeutics). B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora. K.J.K. is a consultant for Tome Biosciences and Vor Biosciences and a member of the scientific advisory board of Nurture Genomics.

Figures

**Figure 1**
A schematic of the SVCR Rows indicate variants that exist in at least one sample in the data. The first four columns describe the key information of the variants, and the following ones store the local allele information for each sample. This format scales variant call data linearly by not duplicating allele information for the reference blocks (sparsity; cross-hatched blue) and using locally indexed fields (LA, local alleles; LGT, local genotypes; LAD, local allele depth; LPL, local phred-scaled likelihoods).

**Figure 2**
A flowchart describing the pipeline for CRAM file decontamination Pipeline for computing contamination rate on simulated data (left). Pipeline for decontaminating a single CRAM file (right).

**Figure 3**
Pipeline for mixing two contamination-free samples with assigned contamination rate and implementing comparison of contamination rate produced by CHARR score and Freemix score

**Figure 4**
Strategy of randomly matching samples with different ancestry groups for the two-way simulation

**Figure 5**
An example of a homozygous variant in gnomAD v3 with infiltration of reference reads Each vertical bar represents a variant, which consists of all the reads sequenced at this site from this specific sample. The gray region on top of the panel represents the distribution of coverage across the variants shown on this panel. The variant presented in the table and also highlighted in the middle of the panel (purple box) has 48 T alleles (red blocks) and 3 A alleles (green blocks). For this specific sample, this variant has a genotype quality (GQ) of 60 and a depth (DP) of 51, indicating a high-quality genotype. However, we observe a relatively high proportion of reference reads (AD[0]/DP = 3/51 = 6%), suggesting the presence of potential contamination.

**Figure 6**
A comparison of Freemix score and CHARR score for 59,765 gnomAD v3 release WGS and 102,063 v2 release WES samples The mean numbers of homozygous variants used for computing CHARR score are 463,668 and 718, respectively for (A) and (B). The black dashed line represents y = x.

**Figure 7**
A comparison of Freemix score (dark blue) and CHARR score (red) for 150 simulated n-way mixed samples across five true contamination levels (x axis) CHARR score is computed using the local allele frequencies filtered to variants with 100% callrate among the 30 decontaminated samples. The black dashed line represents y = x. The error bar stands for 95% confidence interval.

See this image and copyright information in PMC

Update of

CHARR efficiently estimates contamination from DNA sequencing data.
Lu W, Gauthier LD, Poterba T, Giacopuzzi E, Goodrich JK, Stevens CR, King D, Daly MJ, Neale BM, Karczewski KJ. Lu W, et al. bioRxiv [Preprint]. 2023 Jun 28:2023.06.28.545801. doi: 10.1101/2023.06.28.545801. bioRxiv. 2023. Update in: Am J Hum Genet. 2023 Dec 7;110(12):2068-2076. doi: 10.1016/j.ajhg.2023.10.011. PMID: 37425834 Free PMC article. Updated. Preprint.

References

1. Cibulskis K., McKenna A., Fennell T., Banks E., DePristo M., Getz G. ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics. 2011;27:2601–2602. - PMC - PubMed
1. Jun G., Flickinger M., Hetrick K.N., Romm J.M., Doheny K.F., Abecasis G.R., Boehnke M., Kang H.M. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 2012;91:839–848. - PMC - PubMed
1. Zhang F., Flickinger M., Taliun S.A.G., InPSYght Psychiatric Genetics Consortium. Abecasis G.R., Scott L.J., McCaroll S.A., Pato C.N., Boehnke M., Kang H.M. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 2020;30:185–194. - PMC - PubMed
1. Bergmann E.A., Chen B.-J., Arora K., Vacic V., Zody M.C. Conpair: concordance and contamination estimator for matched tumor-normal pairs. Bioinformatics. 2016;32:3196–3198. - PMC - PubMed
1. Pedersen B.S., Quinlan A.R. Who’s Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy. Am. J. Hum. Genet. 2017;100:406–413. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CHARR efficiently estimates contamination from DNA sequencing data

Affiliations

CHARR efficiently estimates contamination from DNA sequencing data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous