Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 21;23(1):103.
doi: 10.1186/s13059-022-02664-4.

Predicting RNA splicing from DNA sequence using Pangolin

Affiliations

Predicting RNA splicing from DNA sequence using Pangolin

Tony Zeng et al. Genome Biol. .

Abstract

Recent progress in deep learning has greatly improved the prediction of RNA splicing from DNA sequence. Here, we present Pangolin, a deep learning model to predict splice site strength in multiple tissues. Pangolin outperforms state-of-the-art methods for predicting RNA splicing on a variety of prediction tasks. Pangolin improves prediction of the impact of genetic variants on RNA splicing, including common, rare, and lineage-specific genetic variation. In addition, Pangolin identifies loss-of-function mutations with high accuracy and recall, particularly for mutations that are not missense or nonsense, demonstrating remarkable potential for identifying pathogenic variants.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Overview of Pangolin and evaluation. a Schematic and architecture of Pangolin. b Heatmap summarizing the performance of Pangolin, SpliceAI, HAL, MMSplice, and MaxEntScan with respect to three metrics including top-1 accuracy. c Precision-recall curves representing the precision and recall from multiple methods for the prediction of splice-disrupting variants as identified in Cheung et al. [8] (1050 splice-disrupting variants out of 27,733 total). d Scatter plots showing measured versus predicted effects of single genetic variants (left) or a combination of genetic variants (right) on RNA splicing. Measured effects of single genetic variants and combinations of variants were obtained from Julien et al. [15] and Baeza-Centurion et al. [3] respectively. e In silico mutagenesis of 6416 exons from human chromosomes 7 and 8. Barplots show for each base the percent of mutations (square root) predicted to increase or decrease usage by at least 0.2
Fig. 2
Fig. 2
Application of Pangolin to a variety of prediction tasks. a Cumulative density plot of the log10 sQTL p-value fold difference between the SNP predicted to affect splicing and that of the lead sQTL SNP for the top 500 sQTLs identified in DGN (All predictions), or for the 100 predictions with the largest predicted effects (inset). b Example of a splice site that shows a large inter-species difference in usage. A single-nucleotide difference between chimp (T) and human (C) is predicted to strongly decrease (resp. increase) usage of a chimp (resp. human) splice site (dashed vertical line indicates the human site). The T (resp. C) difference likely disrupts (resp. creates) a 3’ canonical splice site in chimp (resp. human). c Locations and effects of SNVs ±50bp from a splice site predicted to underlie inter-species differences in splice site usage for 71 3’ and 74 5’ sites. A large fraction—but not all—of splice-altering variants are located near the canonical splice sites. d Survival function plots of BRCA1 variants in splice regions as a function of their predicted effects on splicing. The variants are separated by their classification as loss-of-function (LOF, blue), intermediate effect (INT, orange), or functional (FUNC, green). We observe a huge enrichment of LOF variants among variants with large predicted splicing effects. e Precision-recall curves for different variant types representing the precision and recall for distinguishing LOF variants from functional variants. Pangolin achieves a remarkable AUPRC for variants in extended splice regions (note that this excludes canonical splice variants). See Additional file 1: Fig. S8 for variants from additional annotation bins. f Predicted splicing effects of mutations in or flanking 4 BRCA1 exons from Findlay et al. [12]. Mutations identified to be LOF or to have intermediate phenotypes, as well as missense, nonsense, and canonical splice site mutations are annotated. See Additional file 1: Fig. S9 for all 13 exons with predictions. g Precision-recall curves representing the precision and recall for distinguishing variants annotated as pathogenic from variants annotated as benign in ClinVar. The blue (resp. orange) line represents the PRC for variants excluding (resp. including) variants in annotated splice sites. Missense and nonsense variants are excluded

References

    1. Aguet F, Anand S, Ardlie KG, Gabriel S, Getz GA, Graubert A, Hadley K, Handsaker RE, Huang KH, Kashin S, et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–30. - PMC - PubMed
    1. Avsec ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, Fropf R, McAnany C, Gagneur J, Kundaje A, et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet. 2021;53(3):354–66. - PMC - PubMed
    1. Baeza-Centurion P, Miñana B, Schmiedel JM, Valcárcel J, Lehner B. Combinatorial Genetics Reveals a Scaling Law for the Effects of Mutations on Splicing. Cell. 2019;176(3):549–63. - PubMed
    1. Blencowe BJ. Exonic splicing enhancers: mechanism of action, diversity and role in human genetic diseases. Trends Biochem Sci. 2000;25(3):106–10. - PubMed
    1. Cardoso-Moreira M, Halbert J, Valloton D, Velten B, Chen C, Shao Y, Liechti A, Ascenção K, Rummel C, Ovchinnikova S, et al. Gene expression across mammalian organ development. Nature. 2019;571(7766):505–509. - PMC - PubMed

Publication types

Substances

LinkOut - more resources