Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Sep 20:2025.09.18.677206.
doi: 10.1101/2025.09.18.677206.

Comprehensive benchmarking of somatic structural variant detection at ultra-low allele fractions

Affiliations

Comprehensive benchmarking of somatic structural variant detection at ultra-low allele fractions

Yuwei Zhang et al. bioRxiv. .

Abstract

Postzygotic mosaicism gives rise to somatic structural variants (SVs) at ultra-low variant allele fractions (VAFs), which pose challenges for detection due to the high-coverage sequencing required and noise introduced by sequencing artifacts. Although somatic SV detection has been extensively studied in cancer, these studies are not directly applicable to the study of tissue mosaicism, as they rely on matched normals, target higher VAF ranges, and are enriched for different types of SVs. We present comprehensive benchmark data and best practices for non-cancer somatic SV detection. We created a synthetic mosaic sample by combining six HapMap individuals at varying proportions, generating allele fractions as low as 0.25%. This sample was sequenced to ~2,300x total coverage using Illumina, PacBio, and Nanopore technologies across multiple sequencing centers. A high-confidence benchmark SV set containing over 21,000 pseudo-somatic insertions and deletions ≥50bp was derived from haplotype-resolved assemblies. We evaluated 12 SV discovery pipelines and identified caller-specific strengths and sequencing platform-specific shortcomings. We find that short read-based approaches show reduced recall for insertions and repeat-associated SVs, whereas long-read sequencing achieves high accuracy throughout the genome, increasing linearly with coverage. The best algorithm's sensitivity exceeded 80% for VAFs ≥4% and 15% for VAFs of 0.5-1% with 60x coverage. The publicly available benchmarking data and comparative analysis of current methods provide a foundation for robust discovery of SV mosaicism in non-cancer tissues..

PubMed Disclaimer

Conflict of interest statement

FJS receives research support from Illumina, PacBio and Oxford Nanopore. All other authors declare no conflict.

Figures

Figure 1 -
Figure 1 -. Experimental design of the SMaHT MIMS benchmark
a) Six HapMap samples with HPRC assemblies were in-vitro mixed at different ratios to represent germline and somatic variants. This sample was distributed across five Genome Characterization Centers (GCCs) and sequenced with short-read (Illumina) and long-read technologies (Oxford Nanopore, ONT and Pacific Biosciences, HiFi). Across sequencing runs, 12 somatic SV calling strategies were benchmarked against a high-confidence assembly-based SV set. b) The SV benchmark (Multiple Individuals in a Mixed Sample (MIMS)) was built using diploid assemblies from the HPRC. All SVs were merged and harmonized in a single VCF file and the variants were classified as germline if present in the high abundance sample (HG005, 83.5%) and somatic otherwise. Quality control of the mixes, validation and annotation was performed to select for high-confidence variants. c) Coverage statistics of all the WGS experiments by GCC and technology (based on coverage at one million randomly selected genomic bases). d) Read statistics of all the WGS experiments by GCC and technology.
Figure 2 -
Figure 2 -. Sequencing Validation and Benchmark SV Composition
a) Observed minus expected VAF over TR alleles for a 100x HiFi sequencing experiment. Histogram binwidth 0.05. Hue is expected VAF Bin. All sequencing experiments available in Supplementary Figure 2b. b) Recall for TRs with minimum read support of 1x, 3x, and 5x for six sequencing experiments using two long-read technologies from four sequencing centers. Note that Broad and WashU HiFi experiments overlap. c) Observed Cumulative Distribution Function (CDF) of recall on 100x HiFi sequencing. All sequencing experiments plotted in Supplementary Figure 2c. d) Theoretical CDF of the beta-binomial process modeled on 100x coverage. e) SV Count by Size, Type, and Variant Allele Fraction (VAF). f) SV VAF Distribution for deletions and insertions; Histogram binwidth 0.02.
Figure 3 -
Figure 3 -. Benchmark 12 somatic SV calling strategies.
a) Recall by VAF bin for each platform (Illumina, HiFi, and ONT), shown for the replicate with the highest sequencing depth. Callers are ordered by overall recall; darker color indicates a higher recall. VAF bins by expected VAF are <0.5%, 0.5–1%, 1–4%, 4–5%, 5–10%, 10–20%. b) Difficulty-stratified performance. Left: each SV is assigned to one difficulty category (insertions, small SVs (≤250 bp), low VAF, tandem-repeat overlap, or clustered/adjacent SV) or to a residual “not-challenging” class. Middle & right: recall (middle) and precision (right) are reported for 12 workflows. Recall is calculated on difficulty-exclusive subsets, with sample size P (the number of calls in the subset) shown in headers. The bottom row (“Not challenging”) contains variants lacking any annotated difficulty. Dots show results from individual sequencing replicates (marker shape and color represent replicate and platform, separately, with depth highlighted in the legend), and vertical bars denote the replicate mean for each workflow.
Figure 4 -
Figure 4 -. Platform-specific detection limits for somatic SVs.
Recall across VAF bins at fixed sequencing depths. For each platform (Illumina, HiFi, ONT), the two best-performing callers are plotted; each dot marks the recall achieved in a given VAF bin. Grey lines show the genotyper-estimated recall if detection requires 1, 2, or 3 supporting reads for each sample. Grey vertical bars indicate the average recall gain (averaged across two callers) for a 30x increase in coverage relative to the depth immediately to the left, illustrating the marginal benefit of additional sequencing.
Figure 5 -
Figure 5 -. Validation of orthogonal methods with the SMaHT MIMS benchmark
a) SV recall from a WGS experiment using the Element Bioscience short-read Aviti platform (232x coverage) used as orthogonal validation. Recall of DEL from three short-read SV callers and genotyper are shown. For the genotyper we stratified the SV based on VAF. The numbers shown are the total number DEL to be detected and proportion of detection is shown in the y-axis. b) IGV screenshot of a germline DEL (expected VAF ~94%) that overlaps with three exons of the ZNF718 gene. Shown (from top to bottom) are the benchmark SV followed by coverage, junction information (marked with orange arrows) and reads tracks for the four replicate samples: Baylor College of Medicine (BCM), Broad Institute Inc (Broad), New York Genome Center (NYGC) and Washington University (WashU). c) Boxplot of the coverage distribution of the three exons overlapping and not overlapping the benchmark SV shown in panel b. d) IGV screenshot of a germline DEL (expected VAF ~42%) that overlaps with two exons of the NEK4 gene. Both affected exons show lowered coverage similar to the expectation based on the sample mix and genotype (observed 55.4% and 58.1% coverage, expected 58%). e) Boxplot of the coverage distribution of the three exons overlapping and not overlapping the benchmark SV shown in panel d. f) IGV screenshot of a somatic DEL (expected VAF ~5%, highlighted with an arrow) that overlaps with an exon of the IFI16 gene. The affected exon shows lowered coverage slightly higher to the expectation based on the sample mix and genotype (observed 25% and expected 5%). g) Boxplot of the coverage distribution of the two exons overlapping and not overlapping the benchmark SV shown in panel f.

References

    1. Sekar S., Tomasini L., Proukakis C., Bae T., Manlove L., Jang Y., Scuderi S., Zhou B., Kalyva M., Amiri A., et al. (2020). Complex mosaic structural variations in human fetal brains. Genome Res 30, 1695–1704. - PMC - PubMed
    1. Biesecker L.G., and Spinner N.B. (2013). A genomic view of mosaicism and human disease. Nat. Rev. Genet. 14, 307–320. - PubMed
    1. Bae T., Fasching L., Wang Y., Shin J.H., Suvakov M., Jang Y., Norton S., Dias C., Mariani J., Jourdon A., et al. (2022). Analysis of somatic mutations in 131 human brains reveals aging-associated hypermutability. Science 377, 511–517. - PMC - PubMed
    1. Lai D., Gade M., Yang E., Koh H.Y., Lu J., Walley N.M., Buckley A.F., Sands T.T., Akman C.I., Mikati M.A., et al. (2022). Somatic variants in diverse genes leads to a spectrum of focal cortical malformations. Brain 145, 2704–2720. - PMC - PubMed
    1. Campbell I.M., Yuan B., Robberecht C., Pfundt R., Szafranski P., McEntagart M.E., Nagamani S.C.S., Erez A., Bartnik M., Wiśniowiecka-Kowalnik B., et al. (2014). Parental somatic mosaicism is underrecognized and influences recurrence risk of genomic disorders. Am. J. Hum. Genet. 95, 173–182. - PMC - PubMed

Publication types

LinkOut - more resources