Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 22;13(1):31.
doi: 10.1186/s13073-021-00835-9.

CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores

Affiliations

CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores

Philipp Rentzsch et al. Genome Med. .

Abstract

Background: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies.

Methods: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants.

Results: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance.

Conclusions: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Benchmarking available splice predictions on the MFASS data set. We use the Multiplexed Functional Assay of Splicing using Sort-seq (MFASS) data set to benchmark different available splice effect predictors. MFASS studied splicing effects of more than 27,000 human exonic and intronic variants by creating a synthetic library of the respective exons (or nearest exon for intronic variants) between two GFP exons. The genome integrated sequences are transcribed and it is observed how much each exon is spliced in or out of the reporter mRNAs through RNA-seq. Changes in the percent spliced-in (psi) between reference and alternative sequence alleles are used to identify splice disrupting variants (sdv). We analyze how well different machine learning models distinguish between sdv and no-sdv variants
Fig. 2
Fig. 2
Precision-Recall performance of classifying intronic and exonic MFASS variants. Different machine learning models were used to separate splice disrupting variants from those without a splice effect. Shown are all variants in MFASS (a) that were scored by all splice effect predictors, b only exonic and c only intronic variants. Generally, specialized splice effect predictors, such as MMSplice, SPANR, and SpliceAI, perform better than the more general CADD, both on exonic and intronic variants. We observe the best performance by combining MMSplice and SpliceAI with the percent spliced-in (psi) value of the reference allele in a linear combination (MMAIpsi). Such a model however is assay-specific and circular with MFASS class definitions. A new CADD-Splice model, integrating MMSplice and SpliceAI as features, outperforms previous CADD models
Fig. 3
Fig. 3
Increased enrichment for rare variants at high CADD scores. CADD assigns higher scores with increasing population frequency, despite allele frequency not being included in the model. Here, depletion and enrichment of variants is grouped by frequency and CADD score percentiles, with CADD-Splice outperforming previous versions. At high CADD scores, frequent (MAF > 0.001) and rare (allele count > 1) variants are depleted and singletons (observed once in gnomAD) enriched. For variants in canonical splice sites (left), the difference is mostly within the bootstrapped 95%-confidence interval, but CADD-Splice significantly outperforms previous versions within 20 bp of splice sites (right)
Fig. 4
Fig. 4
Improved performance of CADD for separating common and known pathogenic variants. The CADD-Splice model has a higher auROC than previous CADD versions and specialized splice scores in distinguishing between pathogenic variants from ClinVar and common variants (MAF > 0.05) from gnomAD for both splice site variants (left) and intronic variants (right)

References

    1. Sibley CR, Blazquez L, Ule J. Lessons from non-canonical splicing. Nat Rev Genet. 2016;17:407–421. doi: 10.1038/nrg.2016.46. - DOI - PMC - PubMed
    1. Baralle FE, Giudice J. Alternative splicing as a regulator of development and tissue identity. Nat Rev Mol Cell Biol. 2017;18:437–451. doi: 10.1038/nrm.2017.27. - DOI - PMC - PubMed
    1. Wang ET, Sandberg R, Luo S, et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. - DOI - PMC - PubMed
    1. Pan Q, Shai O, Lee LJ, et al. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40:1413–1415. doi: 10.1038/ng.259. - DOI - PubMed
    1. Katz Y, Wang ET, Airoldi EM, Burge CB. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods. 2010;7:1009–1015. doi: 10.1038/nmeth.1528. - DOI - PMC - PubMed

Publication types