Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach

Yi Zhang¹, Xinan Liu², James MacLeod³, Jinze Liu²

Affiliations

¹ Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA. yi.zhang@uky.edu.
² Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA.
³ Department of Veterinary Science, University of Kentucky, Lexington, KY, 40506, USA.

PMID: 30591034
PMCID: PMC6307148
DOI: 10.1186/s12864-018-5350-1

Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach

Yi Zhang et al. BMC Genomics. 2018.

. 2018 Dec 27;19(1):971.

doi: 10.1186/s12864-018-5350-1.

Authors

Yi Zhang¹, Xinan Liu², James MacLeod³, Jinze Liu²

Affiliations

¹ Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA. yi.zhang@uky.edu.
² Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA.
³ Department of Veterinary Science, University of Kentucky, Lexington, KY, 40506, USA.

PMID: 30591034
PMCID: PMC6307148
DOI: 10.1186/s12864-018-5350-1

Abstract

Background: Exon splicing is a regulated cellular process in the transcription of protein-coding genes. Technological advancements and cost reductions in RNA sequencing have made quantitative and qualitative assessments of the transcriptome both possible and widely available. RNA-seq provides unprecedented resolution to identify gene structures and resolve the diversity of splicing variants. However, currently available ab initio aligners are vulnerable to spurious alignments due to random sequence matches and sample-reference genome discordance. As a consequence, a significant set of false positive exon junction predictions would be introduced, which will further confuse downstream analyses of splice variant discovery and abundance estimation.

Results: In this work, we present a deep learning based splice junction sequence classifier, named DeepSplice, which employs convolutional neural networks to classify candidate splice junctions. We show (I) DeepSplice outperforms state-of-the-art methods for splice site classification when applied to the popular benchmark dataset HS3D, (II) DeepSplice shows high accuracy for splice junction classification with GENCODE annotation, and (III) the application of DeepSplice to classify putative splice junctions generated by Rail-RNA alignment of 21,504 human RNA-seq data significantly reduces 43 million candidates into around 3 million highly confident novel splice junctions.

Conclusions: A model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data has been implemented. The performance of the model was evaluated and compared through comprehensive benchmarking and testing, indicating a reliable performance and gross usability for classifying novel splice junctions derived from RNA-seq alignment.

Keywords: Deep learning; Exon splicing; RNA-seq; Splice junction.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

No permission was required from the ethics committee as the project did not involve testing of human, animal or endangered plant species subjects.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
The ROC curves of DeepSplice, multilayer perceptron network (MLP) and long short-term memory network (LSTM) for (a) donor splice site and (b) acceptor splice site classification on the HS3D data set by 10-fold cross-validation. DeepSplice with convolutional neural network exceeds the other deep learning architectures, achieving an auROC score of 0.983 (0.974) on donor (acceptor) splice site classification

**Fig. 2**
The ROC curves of DeepSplice Splice Junction Mode and Donor+Acceptor Site Mode for splice junction classification on the GENCODE data set. DeepSplice Splice Junction Mode achieves a higher auROC score of 0.989

**Fig. 3**
Visualization of the contribution of nucleotides in the flanking splice sequences to the final decision function of DeepSplice on the HS3D dataset for (a) donor splice site and (b) acceptor splice site classification. For both donor and acceptor site classifiers, intronic bases close to GT-AG di-nucleotides achieve the most importance in the classifiers. In general, intron sequences carry more discriminative information than exon sequences

**Fig. 4**
Visualization of the contribution of nucleotides in the flanking splice sequences to the final decision function of DeepSplice on the GENCODE dataset for splice junction classification. The nucleotides in the proximity of a splice junction have the highest impact on the classification outcome. As observed in the splice site classifiers, the contribution distribution of nucleotides in the flanking splice sequences indicates that intron nucleotides carry more discriminative information than exon nucleotides

**Fig. 5**
Positive splice junctions tend to have high read support and contain the canonical flanking string. a Discrete proportions of negatives, positive semi-canonical splice junctions and positive canonical splice junctions from the classification results, given the average read support per sample. Splice junctions with average read support per sample more than 15 achieve a positive rate of around 88%. In contrast, for splice junctions with average read support per sample no more than 1, only 36% are identified as positive. There is a significant rise in the probability to obtain a positive splice junction with the increase of the average read support per sample. Around 99% positive splice junctions contain the canonical flanking string. b Cumulative proportions of positive semi-canonical and canonical splice junctions with the increase of the average read support per sample

**Fig. 6**
Positive splice junctions tend to have both donor and acceptor sites annotated. a Discrete proportions of negatives, positive splice junctions without annotated site, positive splice junctions with acceptor site annotated, positive splice junctions with donor site annotated and positive splice junctions with two sides annotated, given the average read support per sample. 97% of splice junctions with both sites annotated are classified as positives, while only 39% with both sites being novel are positive. Splice junctions connecting annotated splice sites also tend to be associated with higher read coverage. b Cumulative proportions of positive splice junctions in each category with the increase of the average read support per sample

**Fig. 7**
Splice sites which maintain the coding frame of the exon are observed more often than those which disrupt frame. Positive splice junctions in intropolis near known protein-coding junctions show a periodic pattern. For each donor (acceptor) site in the positive splice junctions, we calculated its distance to the nearest annotated donor (acceptor) site, and then counted the frequency for each position. The red points denote positions that are a multiple of three base pairs from the major splice form, and the black points those that are not

**Fig. 8**
Visualization of splice junction sequence representation and deep convolutional neural network in DeepSplice. Each sequence is converted into a tensor through one-hot encoding in the pre-processing of the sequence representation. The tensor is fed as original input to the deep convolutional neural network, which contains one input layer, two convolutional layers, one fully connected layer (FCN) and one output layer. The convolutional neural network transforms the nucleotide signal in splice junction sequences to the final label of class

**Fig. 9**
Visualization of deep Taylor decomposition in DeepSplice. Deep Taylor decomposition explains the contribution of each nucleotide in the splice junction sequence to the final decision function of the deep convolutional neural network. Deep Taylor decomposition operates by running a backward pass on the trained convolutional neural network using a predefined set of rules

**Fig. 10**
Illustration of splice junction filtering strategy. In this example, two edit distances are calculated. One (E_d) is between anchor sequence at donor site (*G [J*_d-A_d *+ 1:J*_d]) and intermediate flanking sequence next to acceptor site (*G [J*_a-A_a:J_a-1]). The other (E_a) is between anchor sequence at acceptor site (*G [J*_a:J_a *+ A*_a-1]) and intermediate flanking sequence next to donor site (*G [J*_d *+ 1:J*_d *+ A*_d])

See this image and copyright information in PMC

Cited by

A hybrid approach of ensemble learning and grey wolf optimizer for DNA splice junction prediction.
Hamouda E, Tarek M. Hamouda E, et al. PLoS One. 2024 Sep 23;19(9):e0310698. doi: 10.1371/journal.pone.0310698. eCollection 2024. PLoS One. 2024. PMID: 39312561 Free PMC article.
Splice Junction Identification using Long Short-Term Memory Neural Networks.
Regan K, Saghafi A, Li Z. Regan K, et al. Curr Genomics. 2021 Dec 30;22(5):384-390. doi: 10.2174/1389202922666211011143008. Curr Genomics. 2021. PMID: 35283668 Free PMC article.
Reference-informed prediction of alternative splicing and splicing-altering mutations from sequences.
Xu C, Bao S, Chen H, Jiang T, Zhang C. Xu C, et al. bioRxiv [Preprint]. 2024 Apr 8:2024.03.22.586363. doi: 10.1101/2024.03.22.586363. bioRxiv. 2024. Update in: Genome Res. 2024 Aug 20;34(7):1052-1065. doi: 10.1101/gr.279044.124. PMID: 38586002 Free PMC article. Updated. Preprint.
A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms.
Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Scalzitti N, et al. BMC Genomics. 2020 Apr 9;21(1):293. doi: 10.1186/s12864-020-6707-9. BMC Genomics. 2020. PMID: 32272892 Free PMC article.
IUP-BERT: Identification of Umami Peptides Based on BERT Features.
Jiang L, Jiang J, Wang X, Zhang Y, Zheng B, Liu S, Zhang Y, Liu C, Wan Y, Xiang D, Lv Z. Jiang L, et al. Foods. 2022 Nov 21;11(22):3742. doi: 10.3390/foods11223742. Foods. 2022. PMID: 36429332 Free PMC article.

See all "Cited by" articles

References

1. Wang ET, et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456(7221):470–476. - PMC - PubMed
1. Roy B, Haupt LM, Griffiths LR. Review: alternative splicing (AS) of genes as an approach for generating protein complexity. Curr Genomics. 2013;14(3):182–194. - PMC - PubMed
1. Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5(7):613–619. - PubMed
1. Marioni JC, et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–1517. - PMC - PubMed
1. Mortazavi A, et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–628. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed