Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Dec 5:2024.12.02.625685.
doi: 10.1101/2024.12.02.625685.

A robust benchmark for detecting low-frequency variants in the HG002 Genome In A Bottle NIST reference material

Affiliations

A robust benchmark for detecting low-frequency variants in the HG002 Genome In A Bottle NIST reference material

Camille A Daniels et al. bioRxiv. .

Abstract

Somatic mosaicism is an important cause of disease, but mosaic and somatic variants are often challenging to detect because they exist in only a fraction of cells. To address the need for benchmarking subclonal variants in normal cell populations, we developed a benchmark containing mosaic variants in the Genome in a Bottle Consortium (GIAB) HG002 reference material DNA from a large batch of a normal lymphoblastoid cell line. First, we used a somatic variant caller with high coverage (300x) Illumina whole genome sequencing data from the Ashkenazi Jewish trio to detect variants in HG002 not detected in at least 5% of cells from the combined parental data. These candidate mosaic variants were subsequently evaluated using >100x BGI, Element, and PacBio HiFi data. High confidence candidate SNVs with variant allele fractions above 5% were included in the HG002 draft mosaic variant benchmark, with 13/85 occurring in medically relevant gene regions. We also delineated a 2.45 Gbp subset of the previously defined germline autosomal benchmark regions for HG002 in which no additional mosaic variants >2% exist, enabling robust assessment of false positives. The variant allele fraction of some mosaic variants is different between batches of cells, so using data from the homogeneous batch of reference material DNA is critical for benchmarking these variants. External validation of this mosaic benchmark showed it can be used to reliably identify both false negatives and false positives for a variety of technologies and detection algorithms, demonstrating its utility for optimization and validation. By adding our characterization of mosaic variants in this widely-used cell line, we support extensive benchmarking efforts using it in simulation, spike-in, and mixture studies.

Keywords: SNV; Somatic mosaicism; benchmarking; genome in a bottle; genome sequencing; mosaic variant; somatic variant; variant calling.

PubMed Disclaimer

Conflict of interest statement

Competing Interests C.E.M. is a co-Founder of Onegevity. Y.W., M.R., A.V., L.M., W.C., S.C., J.H., R.M., and G.P. are Illumina employees and equity owners. A.C., P.C., K.S., D.C., A.K., and L.B. are employees of Google LLC and receive equity compensation. P.C.B. sits on the scientific advisory boards of Intersect Diagnostics Inc. and BioSymetrics Inc., and previously sat on that of Sage Bionetworks.

Figures

Figure 1 -
Figure 1 -
Trio-based methodology using high coverage Illumina data, Strelka2 somatic caller, and orthogonal next generation sequencing datasets for candidate mosaic variant detection and validation in HG002. (A) AJ trio (NIST RM - HG002, HG003, and HG004) sequencing and reference mapping (GRCh38) were initially performed by Zook et al 2016. (B) In silico sample mixtures were created using HG002 and HG003, treating HG003 as normal and the mixtures as tumor, to determine the limit of detection for variant allele fraction. Strelka2 somatic calling and benchmarking with hap.py was conducted using the GIAB mixtures to estimate a limit of detection (LOD). (C) To identify potential mosaic and de novo variants, a tumor-normal Strelka2 somatic run, with HG002 (son) as tumor and HG003+HG004 (combined parents) as normal, was performed. (D) The Strelka2 callset was benchmarked against the GIAB v4.2.1 small variant benchmark with vcfeval to create a candidate variant set, and three orthogonal high-coverage short- and long-read sequencing technologies were used for validation.
Figure 2 -
Figure 2 -
Manually curated potential mosaic variants (135) depicted as vertical lines and arranged by increasing Strelka2 variant allele frequency (X-axis, left to right). Colored dots represent HG002 Illumina 300x (teal) and orthogonal tech datasets for each variant (BGI 100x - red, Element 136x - green, and PacBio HiFi 108x - purple) with corresponding bam-readcount VAFs located on the X-axis. Shaded area indicates the range of VAF (5% to 30%) of variants targeted for inclusion in the benchmark. The top facet illustrates 85 high-confidence SNVs included in the HG002 mosaic benchmark v1.0, while the bottom facet shows 50 SNVs excluded from the benchmark.
Figure 3.
Figure 3.
SNV variant allele fractions (VAF) (a) for HG002-GRCh38 mosaic benchmark v1.0 and manually curated variants excluded from the benchmark. Values represent VAFs combined across all orthogonal technologies (BGI, Element, Illumina, PacBio Revio, and Sequel). Dashed vertical lines represent the targeted VAF range (5% - 30%) for the HG002 mosaic benchmark. Manually curated variant counts based on GIAB GRCh38 genome stratifications (b) reveal most mosaic benchmark v1.0 variants occur in easy-to-map and non-homopolymer regions of the genome.
Figure 4.
Figure 4.
Mosaic variants change VAF between batches of DNA. HG002 mosaic benchmark variant allele fractions (VAFs) for NIST reference material (RM) 8391 and different batches of non-RM DNA (Coriell, NA24385) for two orthogonal technologies (Element and PacBio Revio). Higher VAFs were observed in direct VAF comparisons between materials compared to GIAB reference material. Coverage: Element RM: 136x, non-RM: 100x; PacBio Revio RM: 48x, non-RM: 120x.

References

    1. Agresti Alan, and Coull Brent A.. 1998. “Approximate Is Better than ‘Exact’ for Interval Estimation of Binomial Proportions.” The American Statistician 52 (2): 119–26.
    1. Ball Madeleine P., Thakuria Joseph V., Alexander Wait Zaranek Tom Clegg, Rosenbaum Abraham M., Wu Xiaodi, Angrist Misha, et al. 2012. “A Public Resource Facilitating Clinical Use of Genomes.” Proceedings of the National Academy of Sciences of the United States of America 109 (30): 11920–27. - PMC - PubMed
    1. Brown Lawrence D., Cai T. Tony, and DasGupta Anirban. 2001. “Interval Estimation for a Binomial Proportion.” Schweizerische Monatsschrift Fur Zahnheilkunde = Revue Mensuelle Suisse D’odonto-Stomatologie / SSO 16 (2): 101–33.
    1. Chen Lixin, Liu Pingfang, Evans Thomas C. Jr, and Ettwiller Laurence M.. 2017. “DNA Damage Is a Pervasive Cause of Sequencing Errors, Directly Confounding Variant Identification.” Science 355 (6326): 752–56. - PubMed
    1. Chin Chen-Shan, Wagner Justin, Zeng Qiandong, Garrison Erik, Garg Shilpa, Fungtammasan Arkarachai, Rautiainen Mikko, et al. 2020. “A Diploid Assembly-Based Benchmark for Variants in the Major Histocompatibility Complex.” Nature Communications 11 (1): 4794. - PMC - PubMed

Publication types