. 2021 Feb 22;13(1):31.

doi: 10.1186/s13073-021-00835-9.

CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores

Philipp Rentzsch^{1

2}, Max Schubach^{1

2}, Jay Shendure^{3

4}, Martin Kircher^{5

6}

Affiliations

¹ Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany.
² Berlin Institute of Health (BIH), 10178, Berlin, Germany.
³ Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, 98195, USA.
⁴ Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA.
⁵ Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany. martin.kircher@bihealth.de.
⁶ Berlin Institute of Health (BIH), 10178, Berlin, Germany. martin.kircher@bihealth.de.

PMID: 33618777
PMCID: PMC7901104
DOI: 10.1186/s13073-021-00835-9

CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores

Philipp Rentzsch et al. Genome Med. 2021.

. 2021 Feb 22;13(1):31.

doi: 10.1186/s13073-021-00835-9.

Authors

Philipp Rentzsch^{1

2}, Max Schubach^{1

2}, Jay Shendure^{3

4}, Martin Kircher^{5

6}

Affiliations

¹ Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany.
² Berlin Institute of Health (BIH), 10178, Berlin, Germany.
³ Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, 98195, USA.
⁴ Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA.
⁵ Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany. martin.kircher@bihealth.de.
⁶ Berlin Institute of Health (BIH), 10178, Berlin, Germany. martin.kircher@bihealth.de.

PMID: 33618777
PMCID: PMC7901104
DOI: 10.1186/s13073-021-00835-9

Abstract

Background: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies.

Methods: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants.

Results: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance.

Conclusions: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Benchmarking available splice predictions on the MFASS data set. We use the Multiplexed Functional Assay of Splicing using Sort-seq (MFASS) data set to benchmark different available splice effect predictors. MFASS studied splicing effects of more than 27,000 human exonic and intronic variants by creating a synthetic library of the respective exons (or nearest exon for intronic variants) between two GFP exons. The genome integrated sequences are transcribed and it is observed how much each exon is spliced in or out of the reporter mRNAs through RNA-seq. Changes in the percent spliced-in (psi) between reference and alternative sequence alleles are used to identify splice disrupting variants (sdv). We analyze how well different machine learning models distinguish between sdv and no-sdv variants

**Fig. 2**
Precision-Recall performance of classifying intronic and exonic MFASS variants. Different machine learning models were used to separate splice disrupting variants from those without a splice effect. Shown are all variants in MFASS (a) that were scored by all splice effect predictors, b only exonic and c only intronic variants. Generally, specialized splice effect predictors, such as MMSplice, SPANR, and SpliceAI, perform better than the more general CADD, both on exonic and intronic variants. We observe the best performance by combining MMSplice and SpliceAI with the percent spliced-in (psi) value of the reference allele in a linear combination (MMAIpsi). Such a model however is assay-specific and circular with MFASS class definitions. A new CADD-Splice model, integrating MMSplice and SpliceAI as features, outperforms previous CADD models

**Fig. 3**
Increased enrichment for rare variants at high CADD scores. CADD assigns higher scores with increasing population frequency, despite allele frequency not being included in the model. Here, depletion and enrichment of variants is grouped by frequency and CADD score percentiles, with CADD-Splice outperforming previous versions. At high CADD scores, frequent (MAF > 0.001) and rare (allele count > 1) variants are depleted and singletons (observed once in gnomAD) enriched. For variants in canonical splice sites (left), the difference is mostly within the bootstrapped 95%-confidence interval, but CADD-Splice significantly outperforms previous versions within 20 bp of splice sites (right)

**Fig. 4**
Improved performance of CADD for separating common and known pathogenic variants. The CADD-Splice model has a higher auROC than previous CADD versions and specialized splice scores in distinguishing between pathogenic variants from ClinVar and common variants (MAF > 0.05) from gnomAD for both splice site variants (left) and intronic variants (right)

See this image and copyright information in PMC

References

1. Sibley CR, Blazquez L, Ule J. Lessons from non-canonical splicing. Nat Rev Genet. 2016;17:407–421. doi: 10.1038/nrg.2016.46. - DOI - PMC - PubMed
1. Baralle FE, Giudice J. Alternative splicing as a regulator of development and tissue identity. Nat Rev Mol Cell Biol. 2017;18:437–451. doi: 10.1038/nrm.2017.27. - DOI - PMC - PubMed
1. Wang ET, Sandberg R, Luo S, et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. - DOI - PMC - PubMed
1. Pan Q, Shai O, Lee LJ, et al. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40:1413–1415. doi: 10.1038/ng.259. - DOI - PubMed
1. Katz Y, Wang ET, Airoldi EM, Burge CB. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods. 2010;7:1009–1015. doi: 10.1038/nmeth.1528. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 CA197139/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores

Affiliations

CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources