Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 7;110(12):2068-2076.
doi: 10.1016/j.ajhg.2023.10.011. Epub 2023 Nov 23.

CHARR efficiently estimates contamination from DNA sequencing data

Affiliations

CHARR efficiently estimates contamination from DNA sequencing data

Wenhan Lu et al. Am J Hum Genet. .

Abstract

DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.

Keywords: DNA sequencing; contamination; data science; genetic research; quality control; variant calling.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests M.J.D. is a founder of Maze Therapeutics and Neumora Therapeutics, Inc. (f/k/a RBNC Therapeutics). B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora. K.J.K. is a consultant for Tome Biosciences and Vor Biosciences and a member of the scientific advisory board of Nurture Genomics.

Figures

None
Graphical abstract
Figure 1
Figure 1
A schematic of the SVCR Rows indicate variants that exist in at least one sample in the data. The first four columns describe the key information of the variants, and the following ones store the local allele information for each sample. This format scales variant call data linearly by not duplicating allele information for the reference blocks (sparsity; cross-hatched blue) and using locally indexed fields (LA, local alleles; LGT, local genotypes; LAD, local allele depth; LPL, local phred-scaled likelihoods).
Figure 2
Figure 2
A flowchart describing the pipeline for CRAM file decontamination Pipeline for computing contamination rate on simulated data (left). Pipeline for decontaminating a single CRAM file (right).
Figure 3
Figure 3
Pipeline for mixing two contamination-free samples with assigned contamination rate and implementing comparison of contamination rate produced by CHARR score and Freemix score
Figure 4
Figure 4
Strategy of randomly matching samples with different ancestry groups for the two-way simulation
Figure 5
Figure 5
An example of a homozygous variant in gnomAD v3 with infiltration of reference reads Each vertical bar represents a variant, which consists of all the reads sequenced at this site from this specific sample. The gray region on top of the panel represents the distribution of coverage across the variants shown on this panel. The variant presented in the table and also highlighted in the middle of the panel (purple box) has 48 T alleles (red blocks) and 3 A alleles (green blocks). For this specific sample, this variant has a genotype quality (GQ) of 60 and a depth (DP) of 51, indicating a high-quality genotype. However, we observe a relatively high proportion of reference reads (AD[0]/DP = 3/51 = 6%), suggesting the presence of potential contamination.
Figure 6
Figure 6
A comparison of Freemix score and CHARR score for 59,765 gnomAD v3 release WGS and 102,063 v2 release WES samples The mean numbers of homozygous variants used for computing CHARR score are 463,668 and 718, respectively for (A) and (B). The black dashed line represents y = x.
Figure 7
Figure 7
A comparison of Freemix score (dark blue) and CHARR score (red) for 150 simulated n-way mixed samples across five true contamination levels (x axis) CHARR score is computed using the local allele frequencies filtered to variants with 100% callrate among the 30 decontaminated samples. The black dashed line represents y = x. The error bar stands for 95% confidence interval.

Update of

References

    1. Cibulskis K., McKenna A., Fennell T., Banks E., DePristo M., Getz G. ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics. 2011;27:2601–2602. - PMC - PubMed
    1. Jun G., Flickinger M., Hetrick K.N., Romm J.M., Doheny K.F., Abecasis G.R., Boehnke M., Kang H.M. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 2012;91:839–848. - PMC - PubMed
    1. Zhang F., Flickinger M., Taliun S.A.G., InPSYght Psychiatric Genetics Consortium. Abecasis G.R., Scott L.J., McCaroll S.A., Pato C.N., Boehnke M., Kang H.M. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 2020;30:185–194. - PMC - PubMed
    1. Bergmann E.A., Chen B.-J., Arora K., Vacic V., Zody M.C. Conpair: concordance and contamination estimator for matched tumor-normal pairs. Bioinformatics. 2016;32:3196–3198. - PMC - PubMed
    1. Pedersen B.S., Quinlan A.R. Who’s Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy. Am. J. Hum. Genet. 2017;100:406–413. - PMC - PubMed

LinkOut - more resources