Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

Abstract

Structural variants (SVs) contribute substantially to genomic variation and disease, but detecting somatic SVs (sSVs) remains difficult due to reference bias, mosaicism, and enrichment in repetitive regions. Linear reference genomes, like GRCh38 and CHM13, do not fully capture individual genomic structure, which can obscure true somatic variation. Donor-specific assemblies (DSAs) generated from the same genome where sSVs are being assayed provide a personalized alternative, yet their performance for sSV detection has not been systematically assessed. As part of the Somatic Mosaicism across Human Tissues (SMaHT) Network, we benchmark a DSA for sSV discovery in the COLO829 melanoma cell line with a matched normal sample from the same individual. We compare sSV detection across GRCh38, CHM13, and the COLO829BL_DSA using three different sSV callers (Delly, Severus, and Sniffles2) and sequence data from multiple long-read platforms. The COLO829BL_DSA identifies 1.8-fold more manually validated sSVs than linear references, in regions both shared with GRCh38 and CHM13 and unique to the COLO829BL_DSA. Variants detected only with the COLO829BL_DSA are often found in satellite and other repeat-rich regions that are difficult to resolve using standard references. In addition, several COLO829BL_DSA-specific sSVs are located in genes, some of which are cancer associated. Overall, these results underscore the utility of DSAs in improving sSV detection.

PubMed Disclaimer

Conflict of interest statement

COMPETING INTERESTS

A.S. is a co-inventor on a patent related to the Fiber-seq and DAF-seq methods. J.B. is a consultant for Mosaica Medicines. E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc. All other authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Somatic structural variant (sSV) discovery.
A) Schematic depicting study design. PacBio high-fidelity (HiFi), standard Oxford Nanopore Technologies (STD-ONT), and ultra-long ONT (UL-ONT) sequencing data for COLO829BL and COLO829 were aligned to three reference genomes: GRCh38, CHM13, and the COLO829BL_DSA. MAPQ values were manually adjusted for the COLO829BL_DSA calls to avoid variant dropout. Three sSV callers, Delly,15 Severus,16 and Sniffles2,17 were used to detect somatic structural variation. B) Raw sSV call counts for each caller, reference genome, and type of sequencing data. Color represents the type of sSV. Note sSV counts vary by more than an order of magnitude depending on the caller. C) Variant allele frequency (VAF) distributions for GRCh38 and CHM13, and haploid variant allele frequency (hVAF) for COLO829BL_DSA, shown for raw sSVs stratified by reference genome and caller. DSA: donor-specific assembly
Figure 2.
Figure 2.. sSV callset integration.
A) Schematic illustrating the process of integrating raw sSV calls, which involves removing redundancy by sequencing data type and validating each variant through detection by multiple callers or sequencing data types. B) Plot summarizing the number of unique variants per caller and reference genome after integrating all sequencing data types (HiFi, STD-ONT, UL-ONT). Bars are colored by the sequencing data types that detected each variant: HiFi, ONT (STD-ONT or UL-ONT), or both. C) Pie charts summarizing the number of variants in each reference genome detected by at least two of three sSV callers or by all three sequencing data types. Slice color indicates structural variant type: pink - BND (breakend), blue - DEL (deletion), green - DUP (duplication), purple - INS (insertion), and orange - INV (inversion). D) Upset plot indicating which callers detected each variant in the consensus callset and the total number of variants contributed by each sSV caller. Bar color corresponds to the reference genome.
Figure 3.
Figure 3.. Reference comparisons.
A) Percentage of variants from the consensus set manually validated as true positive calls across the different reference genomes. B) Plot showing the number of sSVs mapping to tandem repeat (TR) or segmental duplication (SD) regions of the genome across the three different reference genomes. C) Sequencing data types that called each true positive variant across HiFi and ONT. D) Combinations of sSV callers that called each true positive variant.
Figure 4.
Figure 4.. Sequence properties and genome-wide distribution of sSVs.
A) Plot showing the number of sSV calls for each reference genome converted to CHM13 coordinates (displayed in Figure 3C,D) compared to the percentage that are unique to that reference genome (not displayed in Figure 3C,D). B) Genomic regions of COLO829BL_DSA variants for either COLO829BL_DSA variants converted to CHM13 or those unique to the COLO829BL_DSA. For those unique to the COLO829BL_DSA, a pie chart of the satellite subtypes is shown, with the vast majority as alpha satellites. C) Example of a deletion in a COLO829BL_DSA-unique alpha satellite region. D) Ideogram showing the location of sSV calls across the genome for all manually validated sSVs able to be mapped to CHM13 coordinates. Each reference is represented by a different color and shape, and overlap occurs when the shapes intersect. For any variants that overlap genes, the gene name is notated. E) Venn diagram displaying the number of overlapping variants between the different reference genomes when the sSV calls are mapped to CHM13 coordinates.
Figure 5.
Figure 5.. COLO829BL_DSA-only sSVs near regulatory regions.
Example of COLO829BL_DSA-only manually validated deletion on one haplotype impacting A) OIP5, B) SLCO3A1, and C) SIGLEC7. BL: blood; T: tumor

References

    1. Astashyn A, Tvedte ES, Sweeney D, Sapojnikov V, Bouk N, Joukov V, Mozes E, Strope PK, Sylla PM, Wagner L, et al. 2024. Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol 25: 60. - PMC - PubMed
    1. Aydin SK, Yilmaz KC, Acar A. 2025. Benchmarking long-read structural variant calling tools and combinations for detecting somatic variants in cancer genomes. Sci Rep 15: 8707. - PMC - PubMed
    1. Benson G. 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580. - PMC - PubMed
    1. Benthami H, Zohair B, Rezouki I, Naji O, Miyara K, Ennachit S, Elkarroumi M, Aschawa H, Badou A. 2025. Elevated Siglec-7 expression correlates with adverse clinicopathological, immunological, and therapeutic response signatures in breast cancer patients. Front Immunol 16: 1573365. - PMC - PubMed
    1. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10: 421. - PMC - PubMed

Publication types

LinkOut - more resources