Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 14;35(4):914-928.
doi: 10.1101/gr.279323.124.

Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease

Affiliations

Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease

Tanner D Jensen et al. Genome Res. .

Abstract

Rare structural variants (SVs)-insertions, deletions, and complex rearrangements-can cause Mendelian disease, yet they remain difficult to accurately detect and interpret. We sequenced and analyzed Oxford Nanopore Technologies long-read genomes of 68 individuals from the undiagnosed disease network (UDN) with no previously identified diagnostic mutations from short-read sequencing. Using our optimized SV detection pipelines and 571 control long-read genomes, we detected 716 long-read rare (MAF < 0.01) SV alleles per genome on average, achieving a 2.4× increase from short reads. To characterize the functional effects of rare SVs, we assessed their relationship with gene expression from blood or fibroblasts from the same individuals and found that rare SVs overlapping enhancers were enriched (LOR = 0.46) near expression outliers. We also evaluated tandem repeat expansions (TREs) and found 14 rare TREs per genome; notably, these TREs were also enriched near overexpression outliers. To prioritize candidate functional SVs, we developed Watershed-SV, a probabilistic model that integrates expression data with SV-specific genomic annotations, which significantly outperforms baseline models that do not incorporate expression data. Watershed-SV identified a median of eight high-confidence functional SVs per UDN genome. Notably, this included compound heterozygous deletions in FAM177A1 shared by two siblings, which were likely causal for a rare neurodevelopmental disorder. Our observations demonstrate the promise of integrating long-read sequencing with gene expression toward improving the prioritization of functional SVs and TREs in rare disease patients.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Undiagnosed patient cohort description and pipeline overview. Cohort description: (A) Patients were recruited from the UDN for a long-read sequencing (LR-GS) study. These included 57 affected individuals and 11 unaffected family members from a wide range of primary symptom categories, including Neurology, musculoskeletal, and cardiology. Patients had previous short-read genetic testing with Illumina that was negative or inconclusive. (B) Long-read Pipeline Overview: individuals were sequenced on R9.4 flowcells on the ONT PromethION. Consensus SVs were called by merging SVs across individual callers and keeping those that showed multialgorithm support. A population merge of the UDN genomes together with the Stanford ADRC population reference of 579 nanopore genomes, allowed ascertainment of robust allele frequencies for SVs. Rare SVs were filtered and intersected with overlapping genome annotations to input into Watershed. Vamos was used on a catalog of polymorphic tandem repeats to genotype tandem repeat copy numbers. A mean neighbor distance-based outlier calling method was used to define extreme repeat expansions. (C) RNA sequencing expression outlier pipeline: transcriptome data from the UDN was processed by quantifying expression, combining with tissue-matched controls from GTEx, normalizing for library size and composition bias, and correcting for batch effects and hidden factors. Expression outliers of the normalized data were input into Watershed. (D) Watershed-SV integrates signals from rare SVs and overlapping genome annotations to predict variants with large functional effects. High-scoring watershed variants are prioritized and curated per patient for disease relevance.
Figure 2.
Figure 2.
Long-read sequencing detects rare SVs and extreme tandem repeat expansions (TREs). (A) Length distribution of deletions and insertions detected by each technology on a log–log axis. SVs were called with a consensus SV calling pipeline including SVIM, cuteSV, and Sniffles2 for long reads and MantaSV calls were genotyped with paragraph for short reads. Dashed line represents 50 bp, the threshold for calling an indel an SV. (B) Mean tandem repeat copy numbers estimated from the UDN genomes stratified by repeat motif length. Short tandem repeats (STR) have repeat motifs between 2 bp and 6 bp. Variable number tandem repeats (VNTRs) have repeat motifs greater or equal to 7 bp. Vamos was used to genotype tandem repeat copy number in long reads and ExpansionHunter was used in short reads. Each tool used a different tandem repeat loci catalog to define TRs. Counts of TRs by repeat motif length bins present in the tools respective catalog is also plotted. (C) Allele frequency distribution of long-read discovered SVs from Jasmine-SV merge with ADRC genomes. ADRC provided a reference sample of 600 nanopore genomes to allow robust estimation of minor allele frequencies. (D) Count of rare SVs (MAF < 0.01), detected per individual stratified by SV Type and Technology. Short-read SVs were annotated with allele frequencies using SVAFotate and a lookup in gnomAD, CCDG, and 1000 G. (E) Count of extreme TRE detected per individual. Extreme TRE outliers in each technology were called by jointly estimating repeat copy number distribution of long-read vamos calls with the ADRC and of short-read ExpansionHunter calls with 1000 G, and then calculating for each allele its average distance from its k-nearest neighbors. Extreme TREs were defined as alleles with a standardized mean neighbor distance (MND) >2, with k = 5 for long reads and k = 25 for short reads.
Figure 3.
Figure 3.
Rare long-read-discovered SVs are strongly enriched proximal to gene expression outliers (A) enrichment of rare SVs, stratified by type, within 10 kb of an expression outlier gene given the specified absolute Z-score threshold. Estimate of log odds ratio plotted with error bars representing standard errors of the estimate. (B) Directional enrichment for rare SVs within 10 kb of either over (Z > 4) or under (Z < −4) expression outliers. TRE-underexpression enrichment not plotted due to an insufficient number of rare TREs near underexpression outliers. (C) Enrichment of rare SVs, across all SV types, within 100 kb of expression outliers, stratified by genome and variant annotation categories. Gene body position displays enrichment of VEP annotated categories for SV location relative to the gene body of the expressed gene. If an SV overlaps multiple categories, it is assigned to the one with highest priority given the following ordering: CDS, 5′ UTR, 3′ UTR, intron, upstream noncoding, downstream noncoding. SV length and CADDSV deleteriousness display enrichment of rare SVs with length and CADDSV score respectively above the specified threshold. VEP impact displays enrichment of rare SVs with the given VEP impact category, where HIGH represents predicted loss-of-function variants. Finally, we display enrichment of SVs that overlap with noncoding regulatory annotations, including if it overlaps an ABC regulatory element linked to the expressed gene, a conserved transcription factor binding site (TFBS), a high density of ChIP-seq peaks defining conserved regulatory modules (CRM) from ReMap, a TAD boundary detected in multiple cell types, highly constrained LINSIGHT SNVs, or a highly conserved region by phastCons. We also display enrichments for SVs that overlapped any one of these annotations (putative regulatory SVs) and for SVs that do not overlap with any of these annotations (putative nonregulatory).
Figure 4.
Figure 4.
Watershed-SV improves the prioritization of rare SVs in healthy and muscular dystrophy cohort. (A) Precision-recall curves (PRC) of benchmark using held-out N2 pairs; We ran multitissue Watershed-SV using both 10 kb (solid) and 100 kb (dashed) distance limit as well as WGS-only model and CADD-SV with the same setup. (B) Top positive genomic annotation effect sizes (β) for seven major categories of the 10 kb multitissue Watershed-SV model. (C) Using a Z-score threshold of −3 and 3, we stratified 100 kb multitissue Watershed-SV model prediction on CMG muscular disorder data set posterior probabilities by under-, over-, and nonoutliers (column), and then by coding versus noncoding variants (row); each dot represent an gene-SV pair. (D) Top positive genomic annotation effect sizes for 100 kb multitissue Watershed-SV model. Seven annotation categories are grouped into region-specific (TSS/upstream Flank, Gene Body, TES/downstream Flank) and region-agnostic features. Region-specific features are separately aggregated for each SV, then collapsed to each gene by regions.
Figure 5.
Figure 5.
Watershed-SV prioritizes symptom-relevant functional rare SVs from UDN LR-GS data set. (A) Swarmplot for number of gene-SV pairs prioritized per individuals in the UDN LR-GS data set under different set of combined filters. There are four filter categories: LR-GS-only filters, LR-GS + HPO filters, LR-GS + RNA filters, and LR-GS + RNA + HPO filters, in increasing level of stringency due to increasing types of filters jointly applied; red dot represents the mean number of gene-SV pairs across individuals, red horizontal line represents standard deviation; x-axis is in log2 scale; the bar plot on the right shows number of samples with significant prioritizations. (B) UpSet plot depicting number of gene-SV pairs prioritized by Watershed-SV (posterior > 0.6), CADD-SV (score > 10), and whether the SV is uniquely identified using LR-GS. (C,E) Case example 1, rare TREs shared by both siblings, and case example 2, rare compound heterozygous deletions in siblings. Lollipop plot shows which set of filter includes the candidate diagnostic gene-SV pair (triangle) and which does not (circle), height of the lollipop represents number of gene-SV pairs prioritized in log2 scale. (D) Panels depict the TR copy numbers of the siblings and unaffected parent with less-expanded allele. The TRE loci is in 5′ UTR of FAM193B. Both Watershed-SV and CADD-SV can prioritize this but not WGS-only baseline model. Both siblings have extremely high overexpression Z-scores. (F) Panels depict the compound heterozygous deletions phased onto both alleles for FAM177A1, causing LOF of gene and thereby underexpression outliers. Only Watershed-SV succeeded at prioritizing both variants.

Update of

References

    1. Abel HJ, Larson DE, Regier AA, Chiang C, Das I, Kanchi KL, Layer RM, Neale BM, Salerno WJ, Reeves C, et al. 2020. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583: 83–89. 10.1038/s41586-020-2371-0 - DOI - PMC - PubMed
    1. Abyzov A, Urban AE, Snyder M, Gerstein M. 2011. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 21: 974–984. 10.1101/gr.114876.110 - DOI - PMC - PubMed
    1. Alazami AM, Patel N, Shamseldin HE, Anazi S, Al-Dosari MS, Alzahrani F, Hijazi H, Alshammari M, Aldahmesh MA., Salih MA, et al. 2015. Accelerating novel candidate gene discovery in neurogenetic disorders via whole-exome sequencing of prescreened multiplex consanguineous families. Cell Rep 10: 148–161. 10.1016/j.celrep.2014.12.015 - DOI - PubMed
    1. Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, et al. 2019. Characterizing the major structural variant alleles of the human genome. Cell 176: 663–675.e19. 10.1016/j.cell.2018.12.019 - DOI - PMC - PubMed
    1. Bakhtiari M, Park J, Ding Y-C, Shleizer-Burko S, Neuhausen SL, Halldórsson BV, Stefánsson K, Gymrek M, Bafna V. 2021. Variable number tandem repeats mediate the expression of proximal genes. Nat Commun 12: 2075. 10.1038/s41467-021-22206-z - DOI - PMC - PubMed

LinkOut - more resources