This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Jun 28:2023.06.28.545801.

doi: 10.1101/2023.06.28.545801.

CHARR efficiently estimates contamination from DNA sequencing data

Wenhan Lu^{1

2

3}, Laura D Gauthier^{1

4}, Timothy Poterba^{1

2

3}, Edoardo Giacopuzzi⁵, Julia K Goodrich^{1

2}, Christine R Stevens^{1

2

3}, Daniel King^{1

2

3}, Mark J Daly^{1

2

3

6}, Benjamin M Neale^{1

2

3

7}, Konrad J Karczewski^{1

2

7}

Affiliations

¹ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
² Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
³ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
⁴ Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
⁵ Human Technopole, Viale Rita Levi-Montalcini 1, 20157 Milano, ITALY.
⁶ Institute for Molecular Medicine Finland, Helsinki, Finland.
⁷ Novo Nordisk Foundation Center, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

PMID: 37425834
PMCID: PMC10327099
DOI: 10.1101/2023.06.28.545801

CHARR efficiently estimates contamination from DNA sequencing data

Wenhan Lu et al. bioRxiv. 2023.

[Preprint]. 2023 Jun 28:2023.06.28.545801.

doi: 10.1101/2023.06.28.545801.

Authors

Affiliations

¹ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
² Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
³ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
⁴ Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
⁵ Human Technopole, Viale Rita Levi-Montalcini 1, 20157 Milano, ITALY.
⁶ Institute for Molecular Medicine Finland, Helsinki, Finland.
⁷ Novo Nordisk Foundation Center, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.

PMID: 37425834
PMCID: PMC10327099
DOI: 10.1101/2023.06.28.545801

Update in

CHARR efficiently estimates contamination from DNA sequencing data.
Lu W, Gauthier LD, Poterba T, Giacopuzzi E, Goodrich JK, Stevens CR, King D, Daly MJ, Neale BM, Karczewski KJ. Lu W, et al. Am J Hum Genet. 2023 Dec 7;110(12):2068-2076. doi: 10.1016/j.ajhg.2023.10.011. Epub 2023 Nov 23. Am J Hum Genet. 2023. PMID: 38000370 Free PMC article.

Abstract

DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.

PubMed Disclaimer

Figures

**Figure 1 |**
An example of a homozygous variant in gnomAD v3 with infiltration of reference reads. For this specific sample, the variant has a genotype quality (GQ) of 60 and a depth (DP) of 51, indicating a high-quality genotype. However, we observe a relatively high proportion of reference reads (AD[0]/DP = 3/51 = 6%), suggesting the presence of potential contamination.

**Figure 2 |**
A comparison of Freemix score and CHARR for 59,765 gnomAD v3 release WGS and 102,063 v2 release WES samples. The mean numbers of homozygous variants used for computing CHARR are 463668 and 718, respectively for (A) and (B). The black dashed line represents y=x.

**Figure 3 |**
A comparison of Freemix score (green) and CHARR (orange) for 150 simulated n-way mixed samples across five true contamination levels (x-axis). CHARR is computed using the local allele frequencies filtered to variants with 100% callrate among the 30 decontaminated samples. The black dashed line represents y=x.

**Figure M1 |**
A schematic of the SVCR. Rows indicate variants that exist in at least one sample in the data. The first four columns describe the key information of the variants, and the following ones store the local allele information for each sample. This format scales variant call data linearly by not duplicating allele information for the reference blocks (sparsity; cross-hatched blue) and using locally indexed fields (LA: local alleles, LGT: local genotypes, LAD: local allele depth, LPL: local phred-scaled likelihoods).

**Figure M2 |**
A flowchart describing the pipeline for CRAM file decontamination. Pipeline for computing contamination rate on simulated data (left). Pipeline for decontaminating a single CRAM file (right).

**Figure M3 |**
Pipeline for mixing two contamination-free samples with assigned contamination rate and implementing comparison of contamination rate produced by CHARR and Freemix Score.

**Figure M4 |**
Strategy of randomly matching samples with different ancestry groups for the two-way simulation.

See this image and copyright information in PMC

References

1. Bergmann E. A., Chen B.-J., Arora K., Vacic V., & Zody M. C. (2016). Conpair: concordance and contamination estimator for matched tumor-normal pairs. Bioinformatics , 32(20), 3196–3198. 10.1093/bioinformatics/btw389 - DOI - PMC - PubMed
1. Bergström A., McCarthy S. A., Hui R., Almarri M. A., Ayub Q., Danecek R., Chen Y, Felkel S., Hallast R., Kamm J., Blanché H., Deleuze J.-F., Cann H., Mallick S., Reich D., Sandhu M. S., Skoglund P., Scally A., Xue Y, … Tyler-Smith C. (2020). Insights into human genetic variation and population history from 929 diverse genomes. Science, 367(6484). 10.1126/science.aay5012 - DOI - PMC - PubMed
1. Chen S., Francioli L. C., Goodrich J. K., Collins R. L., Kanai M., Wang Q., Alföldi J., Watts N. A., Vittal C., Gauthier L. D., Poterba T., Wilson M. W., Tarasova Y, Phu W., Yohannes M. T., Koenig Z., Farjoun Y, Banks E., Donnelly S., … Karczewski K. J.. (2022). A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. In bioRxiv (p. 2022.03.20.485034). 10.1101/2022.03.20.485034 - DOI
1. Cibulskis K., McKenna A., Fennell T., Banks E., DePristo M., & Getz G. (2011). ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics , 27(18), 2601–2602. 10.1093/bioinformatics/btr446 - DOI - PMC - PubMed
1. Hail Team. (2023). Hail 0.2.106-a6c75d687a19. https://github.com/hail-is/hail/commit/a6c75d687a19

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

CHARR efficiently estimates contamination from DNA sequencing data

Affiliations

CHARR efficiently estimates contamination from DNA sequencing data

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous