This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Dec 5:2024.12.02.625685.

doi: 10.1101/2024.12.02.625685.

A robust benchmark for detecting low-frequency variants in the HG002 Genome In A Bottle NIST reference material

Camille A Daniels¹, Adetola Abdulkadir¹, Megan H Cleveland², Jennifer H McDaniel², David Jáspez³, Luis Alberto Rubio-Rodríguez³, Adrián Muñoz-Barrera³, José Miguel Lorenzo-Salazar³, Carlos Flores^{3

4

5

6}, Byunggil Yoo⁷, Sayed Mohammad Ebrahim Sahraeian⁸, Yina Wang⁹, Massimiliano Rossi⁹, Arun Visvanath⁹, Lisa Murray⁹, Wei-Ting Chen⁹, Severine Catreux⁹, James Han⁹, Rami Mehio⁹, Gavin Parnaby⁹, Andrew Carroll¹⁰, Pi-Chuan Chang¹⁰, Kishwar Shafin¹⁰, Daniel Cook¹⁰, Alexey Kolesnikov¹⁰, Lucas Brambrink¹⁰, Mohammed Faizal Eeman Mootor¹¹, Yash Patel¹¹, Takafumi N Yamaguchi¹¹, Paul C Boutros¹¹, Karolina Sienkiewicz¹², Jonathan Foox¹², Christopher E Mason¹², Bryan R Lajoie¹³, Carlos A Ruiz-Perez¹³, Semyon Kruglyak¹³, Justin M Zook², Nathan D Olson²

Affiliations

¹ Medical Device Innovation Consortium (MDIC), 1655 N Ft. Myer Drive, 12th Floor, Arlington, VA, USA 22209.
² Material Measurement Laboratory, National Institute of Standards and Technology (NIST), 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA.
³ Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Polígono Industrial de Granadilla s/n, 38600 Santa Cruz de Tenerife, Spain.
⁴ Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Instituto de Investigación Sanitaria de Canarias, Carretera del Rosario 145, 38010, Santa Cruz de Tenerife, Spain.
⁵ CIBER de Enfermedades Respiratorias (CIBERES), Instituto de Salud Carlos III, Monforte de Lemos 3-5, Pabellón 11, 28029 Madrid, Spain.
⁶ Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, Calle de la Juventud s/n, 35450, Las Palmas de Gran Canaria, Spain.
⁷ Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA.
⁸ Roche Sequencing Solutions, Santa Clara, CA, 95050, USA.
⁹ Illumina Inc., San Diego, CA, USA.
¹⁰ Google Inc, Mountain View, CA, USA.
¹¹ Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA.
¹² Department of Medicine and Weill Cornell Cancer Center, Weill Cornell Medicine, New York, NY, USA.
¹³ Element Biosciences, San Diego, CA, USA.

PMID: 39677813
PMCID: PMC11642750
DOI: 10.1101/2024.12.02.625685

A robust benchmark for detecting low-frequency variants in the HG002 Genome In A Bottle NIST reference material

Camille A Daniels et al. bioRxiv. 2024.

[Preprint]. 2024 Dec 5:2024.12.02.625685.

doi: 10.1101/2024.12.02.625685.

Authors

Affiliations

¹ Medical Device Innovation Consortium (MDIC), 1655 N Ft. Myer Drive, 12th Floor, Arlington, VA, USA 22209.
² Material Measurement Laboratory, National Institute of Standards and Technology (NIST), 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA.
³ Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Polígono Industrial de Granadilla s/n, 38600 Santa Cruz de Tenerife, Spain.
⁴ Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Instituto de Investigación Sanitaria de Canarias, Carretera del Rosario 145, 38010, Santa Cruz de Tenerife, Spain.
⁵ CIBER de Enfermedades Respiratorias (CIBERES), Instituto de Salud Carlos III, Monforte de Lemos 3-5, Pabellón 11, 28029 Madrid, Spain.
⁶ Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, Calle de la Juventud s/n, 35450, Las Palmas de Gran Canaria, Spain.
⁷ Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA.
⁸ Roche Sequencing Solutions, Santa Clara, CA, 95050, USA.
⁹ Illumina Inc., San Diego, CA, USA.
¹⁰ Google Inc, Mountain View, CA, USA.
¹¹ Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA.
¹² Department of Medicine and Weill Cornell Cancer Center, Weill Cornell Medicine, New York, NY, USA.
¹³ Element Biosciences, San Diego, CA, USA.

PMID: 39677813
PMCID: PMC11642750
DOI: 10.1101/2024.12.02.625685

Abstract

Somatic mosaicism is an important cause of disease, but mosaic and somatic variants are often challenging to detect because they exist in only a fraction of cells. To address the need for benchmarking subclonal variants in normal cell populations, we developed a benchmark containing mosaic variants in the Genome in a Bottle Consortium (GIAB) HG002 reference material DNA from a large batch of a normal lymphoblastoid cell line. First, we used a somatic variant caller with high coverage (300x) Illumina whole genome sequencing data from the Ashkenazi Jewish trio to detect variants in HG002 not detected in at least 5% of cells from the combined parental data. These candidate mosaic variants were subsequently evaluated using >100x BGI, Element, and PacBio HiFi data. High confidence candidate SNVs with variant allele fractions above 5% were included in the HG002 draft mosaic variant benchmark, with 13/85 occurring in medically relevant gene regions. We also delineated a 2.45 Gbp subset of the previously defined germline autosomal benchmark regions for HG002 in which no additional mosaic variants >2% exist, enabling robust assessment of false positives. The variant allele fraction of some mosaic variants is different between batches of cells, so using data from the homogeneous batch of reference material DNA is critical for benchmarking these variants. External validation of this mosaic benchmark showed it can be used to reliably identify both false negatives and false positives for a variety of technologies and detection algorithms, demonstrating its utility for optimization and validation. By adding our characterization of mosaic variants in this widely-used cell line, we support extensive benchmarking efforts using it in simulation, spike-in, and mixture studies.

Keywords: SNV; Somatic mosaicism; benchmarking; genome in a bottle; genome sequencing; mosaic variant; somatic variant; variant calling.

PubMed Disclaimer

Conflict of interest statement

Competing Interests C.E.M. is a co-Founder of Onegevity. Y.W., M.R., A.V., L.M., W.C., S.C., J.H., R.M., and G.P. are Illumina employees and equity owners. A.C., P.C., K.S., D.C., A.K., and L.B. are employees of Google LLC and receive equity compensation. P.C.B. sits on the scientific advisory boards of Intersect Diagnostics Inc. and BioSymetrics Inc., and previously sat on that of Sage Bionetworks.

Figures

**Figure 1 -**
Trio-based methodology using high coverage Illumina data, Strelka2 somatic caller, and orthogonal next generation sequencing datasets for candidate mosaic variant detection and validation in HG002. (A) AJ trio (NIST RM - HG002, HG003, and HG004) sequencing and reference mapping (GRCh38) were initially performed by Zook et al 2016. (B) *In silico* sample mixtures were created using HG002 and HG003, treating HG003 as normal and the mixtures as tumor, to determine the limit of detection for variant allele fraction. Strelka2 somatic calling and benchmarking with hap.py was conducted using the GIAB mixtures to estimate a limit of detection (LOD). (C) To identify potential mosaic and de novo variants, a tumor-normal Strelka2 somatic run, with HG002 (son) as tumor and HG003+HG004 (combined parents) as normal, was performed. (D) The Strelka2 callset was benchmarked against the GIAB v4.2.1 small variant benchmark with vcfeval to create a candidate variant set, and three orthogonal high-coverage short- and long-read sequencing technologies were used for validation.

**Figure 2 -**
Manually curated potential mosaic variants (135) depicted as vertical lines and arranged by increasing Strelka2 variant allele frequency (X-axis, left to right). Colored dots represent HG002 Illumina 300x (teal) and orthogonal tech datasets for each variant (BGI 100x - red, Element 136x - green, and PacBio HiFi 108x - purple) with corresponding bam-readcount VAFs located on the X-axis. Shaded area indicates the range of VAF (5% to 30%) of variants targeted for inclusion in the benchmark. The top facet illustrates 85 high-confidence SNVs **included** in the HG002 mosaic benchmark v1.0, while the bottom facet shows 50 SNVs **excluded** from the benchmark.

**Figure 3.**
SNV variant allele fractions (VAF) (a) for HG002-GRCh38 mosaic benchmark v1.0 and manually curated variants excluded from the benchmark. Values represent VAFs combined across all orthogonal technologies (BGI, Element, Illumina, PacBio Revio, and Sequel). Dashed vertical lines represent the targeted VAF range (5% - 30%) for the HG002 mosaic benchmark. Manually curated variant counts based on GIAB GRCh38 genome stratifications (b) reveal most mosaic benchmark v1.0 variants occur in easy-to-map and non-homopolymer regions of the genome.

**Figure 4.**
Mosaic variants change VAF between batches of DNA. HG002 mosaic benchmark variant allele fractions (VAFs) for NIST reference material (RM) 8391 and different batches of non-RM DNA (Coriell, NA24385) for two orthogonal technologies (Element and PacBio Revio). Higher VAFs were observed in direct VAF comparisons between materials compared to GIAB reference material. Coverage: Element RM: 136x, non-RM: 100x; PacBio Revio RM: 48x, non-RM: 120x.

See this image and copyright information in PMC

References

1. Agresti Alan, and Coull Brent A.. 1998. “Approximate Is Better than ‘Exact’ for Interval Estimation of Binomial Proportions.” The American Statistician 52 (2): 119–26.
1. Ball Madeleine P., Thakuria Joseph V., Alexander Wait Zaranek Tom Clegg, Rosenbaum Abraham M., Wu Xiaodi, Angrist Misha, et al. 2012. “A Public Resource Facilitating Clinical Use of Genomes.” Proceedings of the National Academy of Sciences of the United States of America 109 (30): 11920–27. - PMC - PubMed
1. Brown Lawrence D., Cai T. Tony, and DasGupta Anirban. 2001. “Interval Estimation for a Binomial Proportion.” Schweizerische Monatsschrift Fur Zahnheilkunde = Revue Mensuelle Suisse D’odonto-Stomatologie / SSO 16 (2): 101–33.
1. Chen Lixin, Liu Pingfang, Evans Thomas C. Jr, and Ettwiller Laurence M.. 2017. “DNA Damage Is a Pervasive Cause of Sequencing Errors, Directly Confounding Variant Identification.” Science 355 (6326): 752–56. - PubMed
1. Chin Chen-Shan, Wagner Justin, Zeng Qiandong, Garrison Erik, Garg Shilpa, Fungtammasan Arkarachai, Rautiainen Mikko, et al. 2020. “A Diploid Assembly-Based Benchmark for Variants in the Major Histocompatibility Complex.” Nature Communications 11 (1): 4794. - PMC - PubMed

Publication types

Actions

Grants and funding

P30 CA016042/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Cold Spring Harbor Laboratory
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

A robust benchmark for detecting low-frequency variants in the HG002 Genome In A Bottle NIST reference material

Affiliations

A robust benchmark for detecting low-frequency variants in the HG002 Genome In A Bottle NIST reference material

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials