Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jun 28:2023.06.28.545801.
doi: 10.1101/2023.06.28.545801.

CHARR efficiently estimates contamination from DNA sequencing data

Affiliations

CHARR efficiently estimates contamination from DNA sequencing data

Wenhan Lu et al. bioRxiv. .

Update in

  • CHARR efficiently estimates contamination from DNA sequencing data.
    Lu W, Gauthier LD, Poterba T, Giacopuzzi E, Goodrich JK, Stevens CR, King D, Daly MJ, Neale BM, Karczewski KJ. Lu W, et al. Am J Hum Genet. 2023 Dec 7;110(12):2068-2076. doi: 10.1016/j.ajhg.2023.10.011. Epub 2023 Nov 23. Am J Hum Genet. 2023. PMID: 38000370 Free PMC article.

Abstract

DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.

PubMed Disclaimer

Figures

Figure 1 |
Figure 1 |
An example of a homozygous variant in gnomAD v3 with infiltration of reference reads. For this specific sample, the variant has a genotype quality (GQ) of 60 and a depth (DP) of 51, indicating a high-quality genotype. However, we observe a relatively high proportion of reference reads (AD[0]/DP = 3/51 = 6%), suggesting the presence of potential contamination.
Figure 2 |
Figure 2 |
A comparison of Freemix score and CHARR for 59,765 gnomAD v3 release WGS and 102,063 v2 release WES samples. The mean numbers of homozygous variants used for computing CHARR are 463668 and 718, respectively for (A) and (B). The black dashed line represents y=x.
Figure 3 |
Figure 3 |
A comparison of Freemix score (green) and CHARR (orange) for 150 simulated n-way mixed samples across five true contamination levels (x-axis). CHARR is computed using the local allele frequencies filtered to variants with 100% callrate among the 30 decontaminated samples. The black dashed line represents y=x.
Figure M1 |
Figure M1 |
A schematic of the SVCR. Rows indicate variants that exist in at least one sample in the data. The first four columns describe the key information of the variants, and the following ones store the local allele information for each sample. This format scales variant call data linearly by not duplicating allele information for the reference blocks (sparsity; cross-hatched blue) and using locally indexed fields (LA: local alleles, LGT: local genotypes, LAD: local allele depth, LPL: local phred-scaled likelihoods).
Figure M2 |
Figure M2 |
A flowchart describing the pipeline for CRAM file decontamination. Pipeline for computing contamination rate on simulated data (left). Pipeline for decontaminating a single CRAM file (right).
Figure M3 |
Figure M3 |
Pipeline for mixing two contamination-free samples with assigned contamination rate and implementing comparison of contamination rate produced by CHARR and Freemix Score.
Figure M4 |
Figure M4 |
Strategy of randomly matching samples with different ancestry groups for the two-way simulation.

References

    1. Bergmann E. A., Chen B.-J., Arora K., Vacic V., & Zody M. C. (2016). Conpair: concordance and contamination estimator for matched tumor-normal pairs. Bioinformatics , 32(20), 3196–3198. 10.1093/bioinformatics/btw389 - DOI - PMC - PubMed
    1. Bergström A., McCarthy S. A., Hui R., Almarri M. A., Ayub Q., Danecek R., Chen Y, Felkel S., Hallast R., Kamm J., Blanché H., Deleuze J.-F., Cann H., Mallick S., Reich D., Sandhu M. S., Skoglund P., Scally A., Xue Y, … Tyler-Smith C. (2020). Insights into human genetic variation and population history from 929 diverse genomes. Science, 367(6484). 10.1126/science.aay5012 - DOI - PMC - PubMed
    1. Chen S., Francioli L. C., Goodrich J. K., Collins R. L., Kanai M., Wang Q., Alföldi J., Watts N. A., Vittal C., Gauthier L. D., Poterba T., Wilson M. W., Tarasova Y, Phu W., Yohannes M. T., Koenig Z., Farjoun Y, Banks E., Donnelly S., … Karczewski K. J.. (2022). A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. In bioRxiv (p. 2022.03.20.485034). 10.1101/2022.03.20.485034 - DOI
    1. Cibulskis K., McKenna A., Fennell T., Banks E., DePristo M., & Getz G. (2011). ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics , 27(18), 2601–2602. 10.1093/bioinformatics/btr446 - DOI - PMC - PubMed
    1. Hail Team. (2023). Hail 0.2.106-a6c75d687a19. https://github.com/hail-is/hail/commit/a6c75d687a19

Publication types