Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 15;16(1):119.
doi: 10.1186/s13073-024-01383-8.

A validated heart-specific model for splice-disrupting variants in childhood heart disease

Affiliations

A validated heart-specific model for splice-disrupting variants in childhood heart disease

Robert Lesurf et al. Genome Med. .

Abstract

Background: Congenital heart disease (CHD) is the most common congenital anomaly. Almost 90% of isolated cases have an unexplained genetic etiology after clinical testing. Non-canonical splice variants that disrupt mRNA splicing through the loss or creation of exon boundaries are not routinely captured and/or evaluated by standard clinical genetic tests. Recent computational algorithms such as SpliceAI have shown an ability to predict such variants, but are not specific to cardiac-expressed genes and transcriptional isoforms.

Methods: We used genome sequencing (GS) (n = 1101 CHD probands) and myocardial RNA-Sequencing (RNA-Seq) (n = 154 CHD and n = 43 cardiomyopathy probands) to identify and validate splice disrupting variants, and to develop a heart-specific model for canonical and non-canonical splice variants that can be applied to patients with CHD and cardiomyopathy. Two thousand five hundred seventy GS samples from the Medical Genome Reference Bank were analyzed as healthy controls.

Results: Of 8583 rare DNA splice-disrupting variants initially identified using SpliceAI, 100 were associated with altered splice junctions in the corresponding patient myocardium affecting 95 genes. Using strength of myocardial gene expression and genome-wide DNA variant features that were confirmed to affect splicing in myocardial RNA, we trained a machine learning model for predicting cardiac-specific splice-disrupting variants (AUC 0.86 on internal validation). In a validation set of 48 CHD probands, the cardiac-specific model outperformed a SpliceAI model alone (AUC 0.94 vs 0.67 respectively). Application of this model to an additional 947 CHD probands with only GS data identified 1% patients with canonical and 11% patients with non-canonical splice-disrupting variants in CHD genes. Forty-nine percent of predicted splice-disrupting variants were intronic and > 10 bp from existing splice junctions. The burden of high-confidence splice-disrupting variants in CHD genes was 1.28-fold higher in CHD cases compared with healthy controls.

Conclusions: A new cardiac-specific in silico model was developed using complementary GS and RNA-Seq data that improved genetic yield by identifying a significant burden of non-canonical splice variants associated with CHD that would not be detectable through panel or exome sequencing.

Keywords: Congenital Heart Disease; Genomics; Machine Learning; Non-canonical; RNA splicing.

PubMed Disclaimer

Conflict of interest statement

SM is on the Advisory Board of Bristol Myers Squibb, Rocket Pharmaceuticals, and Tenaya Therapeutics. The remaining authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic workflow for development, validation, and application of a random forest model for selecting high-confidence splice-disrupting variants for congenital heart disease. Selection strategy is shown for the identification of splice-disrupting variants in CHD genes. Model development: CHD Discovery cohort (n = 106) was used to identify putative splice-disrupting variants in genome sequencing (GS) data and confirm whether the variants were associated with a significant effect in RNA-sequencing (RNA-Seq) data derived from patient myocardium. These variants and their confirmed effect were then used to construct random forest models for predicting splice-disrupting variants with high-confidence. Model validation: Model performance was validated using independent CHD validation (n = 48) and cardiomyopathy validation (n = 43) cohorts, where both GS and RNA-Seq profiles were available for all probands. Model application: The optimal random forest model was applied to a CHD Extension cohort (n = 947), where only GS data were available for all probands. One hundred thirty two (12%) CHD probands harbored 133 rare, high-confidence splice-disrupting variants in CHD genes, including 47 variants in Tier 1 CHD genes and 86 variants in Tier 2 haploinsufficiency-intolerant CHD genes. RNA-Seq, RNA sequencing; GS, genome sequencing; FDR, false discovery rate; MAF, minor allele frequency; CHD, congenital heart disease
Fig. 2
Fig. 2
Correlation matrix for DNA variant features used in model development. The matrix shows minimal correlation between DNA variant input features used in developing random forest models in the Discovery cohort
Fig. 3
Fig. 3
Performance of random forest models for splice-disrupting variants on internal cross-validation. Four types of models each were designed using either class weights or SMOTE to address class imbalance; internal performance was assessed using five-fold cross-validation to compare area under the curves (AUC) for each model. Weighted model performance: a SpliceAI only AUC, b DNA variant features only AUC, c DNA variant features + myocardial RNA gene expression AUC, d SpliceAI + DNA variant features + myocardial RNA gene expression AUC. e Gini coefficient showing the importance of a specific feature to the nodes and leaves of the random forest model 4. f The odds ratio for selecting variants confirmed to affect splicing was highest for model 4. SMOTE model performance: g SMOTE SpliceAI only AUC, h DNA variant features only AUC, i DNA variant features + myocardial RNA gene expression AUC, j SpliceAI + DNA variant features + myocardial RNA gene expression AUC. k Gini coefficient showing the importance of a specific feature to the nodes and leaves of the random forest model 4. l The odds ratio for selecting variants confirmed to affect splicing was highest for model 4
Fig. 4
Fig. 4
Performance of weighted random forest model for splice-disrupting variants applied to external validation cohorts. The performance of the weighted model was assessed in two external validation cohorts. CHD validation cohort: a SpliceAI only AUC, b DNA variant features only AUC, c DNA variant features + myocardial RNA gene expression AUC, d SpliceAI + DNA variant features + myocardial RNA gene expression AUC. AUC was highest for model 4 in CHD cohort (n = 48). e The odds ratio for selecting variants confirmed to affect splicing was highest for model 4 in CHD cohort. Cardiomyopathy validation cohort: a SpliceAI only AUC, b DNA variant features only AUC, c DNA variant features + myocardial RNA gene expression AUC, d SpliceAI + DNA variant features + myocardial RNA gene expression AUC. AUC was highest for model 4 in cardiomyopathy cohort (n = 43). j The odds ratio for selecting variants confirmed to affect splicing was highest model 4 in cardiomyopathy cohort
Fig. 5
Fig. 5
Frequency of high-confidence splice-disrupting variants in CHD genes. One hundred thirty three confirmed and high-confidence splice disrupting variants in CHD genes were identified in the 1101 CHD patients—Discovery (n = 106), Validation (n = 48), and Extension (n = 947) cohorts. Variants were mapped to their closest annotated wild-type splice site within their corresponding gene. Canonical splice regions are highlighted in gray. a Variant position: Intronic variants > 10 bp from a splice junction accounted for 49% of all variants. b SpliceAI Δ variant scores: Splice-disrupting variants showed high variability in SpliceAI scores. Putatively disease-causing splice-disrupting variants in Tier 1 CHD genes were found in c 4% of TOF probands, and d 5% of TGA probands without an explained genetic etiology, with non-canonical variants representing a majority of splice disrupting variants. TOF, tetralogy of Fallot; TGA, Transposition of the great arteries; SNV, single-nucleotide variant; indel, insertion-deletion
Fig. 6
Fig. 6
Representative splice-disrupting variants in CHD genes. Family pedigrees with CHD harboring representative high-confidence splice disrupting variants in Tier 1 and 2 genes are shown. a TBX20 (Tier 1), b NOTCH1 (Tier 1), c CGNL1 (Tier 2), d CHD7 (Tier 1), e EFTUD2 (Tier 1), and f ACTB (Tier 2). Wild-type exon/intron boundaries below IGV screenshots of RNA-Seq data are represented in black, and alternatively observed boundaries are represented in red. FRASER statistics for outlier splicing events are written below the alternative boundaries. Arrows next to gene names represent reading direction. Purple arrows represent location of DNA splice-disrupting variant. TOF, tetralogy of Fallot; ECA, extracardiac anomalies
Fig. 7
Fig. 7
Gene sets enriched for splice-disrupting variants in CHD genes in the CHD Discovery, Validation, and Extension cohorts (n = 1101). Graph showing significantly enriched Human Phenotype Ontology (HP) terms among genes affected by high-confidence splice-disrupting variants in CHD genes
Fig. 8
Fig. 8
Altered splicing events in CHD genes in myocardium without an identified DNA variant. Family pedigrees with CHD showing representative aberrant splicing events in CHD genes without corresponding DNA variants are shown. a MAP2K1 (Tier 1), b ACTB (Tier 2), and c FBN2 (Tier 2). IGV screenshots of RNA-Seq data for all samples are shown TOF, tetralogy of Fallot; ECA, extra cardiac anomalies
Fig. 9
Fig. 9
CHD case–control burden of splice-disrupting variants. The burden of variants was compared between 947 CHD cases vs 2570 healthy controls. Synonymous variants: The per-sample allele frequency of synonymous variants is shown for a all samples and b the subset of samples with European genetic ancestry. A median of 41 and 40.5 synonymous variant alleles were found in cases and controls, respectively (p > 0.05), indicating that the two cohorts were directly comparable. Splice-disrupting variants: High-confidence splice-disrupting variants found in cases and controls were limited to those selected by weighted model 4, and annotated according to the intragenic region they were located at. Odds ratios and 95% confidence intervals are shown comparing variant burden in CHD cases and controls for c all samples and d the subset of samples with European genetic ancestry. Splice-disrupting variants were predominantly enriched in CHD genes, especially Tier 1 CHD genes. CHD, congenital heart disease; pLI, probability of being loss-of-function intolerant

References

    1. van der Linde D, Konings EEM, Slager MA, Witsenburg M, Helbing WA, Takkenberg JJM, et al. Birth prevalence of congenital heart disease worldwide: a systematic review and meta-analysis. J Am Coll Cardiol. 2011;58:2241–7. - PubMed
    1. Øyen N, Poulsen G, Boyd HA, Wohlfahrt J, Jensen PKA, Melbye M. Recurrence of congenital heart defects in families. Circulation. 2009;120:295–301. - PubMed
    1. Blue GM, Kirk EP, Giannoulatou E, Sholler GF, Dunwoodie SL, Harvey RP, et al. Advances in the Genetics of Congenital Heart Disease: A Clinician’s Guide. J Am Coll Cardiol. 2017;69:859–70. - PubMed
    1. Page DJ, Miossec MJ, Williams SG, Monaghan RM, Fotiou E, Cordell HJ, et al. Whole Exome Sequencing Reveals the Major Genetic Contributors to Nonsyndromic Tetralogy of Fallot. Circ Res. 2019;124:553–63. - PMC - PubMed
    1. Blue GM, Mekel M, Das D, Troup M, Rath E, Ip E, et al. Whole genome sequencing in transposition of the great arteries and associations with clinically relevant heart, brain and laterality genes. Am Heart J. 2022;244:1–13. - PubMed

LinkOut - more resources