. 2019 Dec 27;20(Suppl 23):652.

doi: 10.1186/s12859-019-3306-3.

SpliceFinder: ab initio prediction of splice sites using convolutional neural network

Ruohan Wang¹, Zishuai Wang¹, Jianping Wang², Shuaicheng Li³

Affiliations

¹ Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.
² Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China. jianwang@cityu.edu.hk.
³ Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China. shuaicli@cityu.edu.hk.

PMID: 31881982
PMCID: PMC6933889
DOI: 10.1186/s12859-019-3306-3

SpliceFinder: ab initio prediction of splice sites using convolutional neural network

Ruohan Wang et al. BMC Bioinformatics. 2019.

. 2019 Dec 27;20(Suppl 23):652.

doi: 10.1186/s12859-019-3306-3.

Authors

Ruohan Wang¹, Zishuai Wang¹, Jianping Wang², Shuaicheng Li³

Affiliations

¹ Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.
² Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China. jianwang@cityu.edu.hk.
³ Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China. shuaicli@cityu.edu.hk.

PMID: 31881982
PMCID: PMC6933889
DOI: 10.1186/s12859-019-3306-3

Abstract

Background: Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing.

Result: We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining.

Conclusion: Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

Keywords: Canonical and non-canonical splice sites; Convolutional neural network; Splice site prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
The architecture of our proposed CNN. The input of the neural network is the encoded DNA sequence with the length of L. The first layer is a 1-D convolutional layer, consists of 50 kernels with the size of 9. The second layer is a fully connected layer with 100 neurons, followed by a dropout layer. Another fully connected layer and softmax activation function are applied for the final prediction

**Fig. 2**
The iterative approach for negative set reconstruction. At each iteration, the trained CNN is tested with a randomly chosen genomic sequence, the false positives are collected and added to the training data, which will be used to train our CNN at the next iteration

**Fig. 3**
The effect of sequence length on accuracy. Varying the sequence lengths from 40 to 400 nt, the classification accuracies for the test set of initial dataset and reconstructed dataset are compared

**Fig. 4**
The sequence logos and average weighted contribution scores of nucleotides near the splice site. For *donor* sites, *acceptor* sites, and non-splice-sites with canonical signals, the average weighted contribution scores of different models for each nucleotide near the splice site (located at the position between 200 and 201) is shown. From left to right, the models are generated from the 1st, 50th, and 100th iteration. The sequence logos are made [32] to show the difference of patterns between true and false splice sites. a Donor. b Non-splice-site with GT dimers. c Acceptor. d Non-splice-site with AG dimers

**Fig. 5**
Comparison of classification performance of different methods on the test set of the reconstructed dataset. The compared measures include (a) classification accuracy; (b) ROC curve for *donor* sites (left) and *acceptor* sites (right); (c) Precision-recall curve for *donor* sites (left) and *acceptor* sites (right)

**Fig. 6**
The prediction performance improves after dataset reconstruction. a Using the models generated in the iterative process to predict the splice sites on three randomly chosen genomic sequences, false positive numbers of both *donor* site and *acceptor* site are shown. The false positive numbers of the initial model are set as 100%. b The comparison of accuracy, recall, and false positives numbers between models with and without dataset reconstruction

**Fig. 7**
Comparison of recall of different softwares for *donor* sites of Genomic Sequence III. Using different score cutoff or models generated in the iterative process, the recall values of the four softwares, for *donor* sites of Genomic Sequence III, are calculated

**Fig. 8**
The splice site prediction accuracy of our models for other species. For (a) *Drosophila melanogaster*, (b) *Mus musculus*, (c) *Rattus*, and (d) *Danio rerio*, the models generated in the iterative process are applied to predicting the splice sites on three randomly chosen genomic sequences

**Fig. 9**
The false positive numbers and recall of our models for other species. For (a) *Drosophila melanogaster*, (b) *Mus musculus*, (c) *Rattus*, and (d) *Danio rerio*, the numbers of false positive and values of recall are calculated to show more details of the prediction performance for other species

See this image and copyright information in PMC

Cited by

Principles and Practical Considerations for the Analysis of Disease-Associated Alternative Splicing Events Using the Gateway Cloning-Based Minigene Vectors pDESTsplice and pSpliceExpress.
Putscher E, Hecker M, Fitzner B, Lorenz P, Zettl UK. Putscher E, et al. Int J Mol Sci. 2021 May 13;22(10):5154. doi: 10.3390/ijms22105154. Int J Mol Sci. 2021. PMID: 34068052 Free PMC article. Review.
Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study.
Meher PK, Satpathy S. Meher PK, et al. 3 Biotech. 2021 Nov;11(11):484. doi: 10.1007/s13205-021-03036-8. Epub 2021 Oct 31. 3 Biotech. 2021. PMID: 34790508 Free PMC article.
Splam: a deep-learning-based splice site predictor that improves spliced alignments.
Chao KH, Mao A, Salzberg SL, Pertea M. Chao KH, et al. Genome Biol. 2024 Sep 16;25(1):243. doi: 10.1186/s13059-024-03379-4. Genome Biol. 2024. PMID: 39285451 Free PMC article.
Splice Junction Identification using Long Short-Term Memory Neural Networks.
Regan K, Saghafi A, Li Z. Regan K, et al. Curr Genomics. 2021 Dec 30;22(5):384-390. doi: 10.2174/1389202922666211011143008. Curr Genomics. 2021. PMID: 35283668 Free PMC article.
Applications for Deep Learning in Epilepsy Genetic Research.
Zeibich R, Kwan P, J O'Brien T, Perucca P, Ge Z, Anderson A. Zeibich R, et al. Int J Mol Sci. 2023 Sep 27;24(19):14645. doi: 10.3390/ijms241914645. Int J Mol Sci. 2023. PMID: 37834093 Free PMC article. Review.

See all "Cited by" articles

References

1. Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller K-R, Sommer R-J, Schölkopf B. Improving the caenorhabditis elegans genome annotation using machine learning. PLoS Comput Biol. 2007;3(2):20. - PMC - PubMed
1. Reese MG, Eeckman FH, Kulp D, Haussler D. Improved splice site detection in genie. J Comput Biol. 1997;4(3):311–23. - PubMed
1. Breathnach R, Benoist C, O’hare K, Gannon F, Chambon P. Ovalbumin gene: evidence for a leader sequence in mrna and dna sequences at the exon-intron boundaries. Proc Natl Acad Sci. 1978;75(10):4853–7. - PMC - PubMed
1. Mount SM. A catalogue of splice junction sequences. Nucleic Acids Res. 1982;10(2):459–72. - PMC - PubMed
1. Hodge MR, Cumsky MG. Splicing of a yeast intron containing an unusual 5’junction sequence. Mol Cell Biol. 1989;9(6):2765–70. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SpliceFinder: ab initio prediction of splice sites using convolutional neural network

Affiliations

SpliceFinder: ab initio prediction of splice sites using convolutional neural network

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous