Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 13:5:100035.
doi: 10.1016/j.gene.2020.100035. eCollection 2020 Dec.

Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA

Affiliations

Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA

Somayah Albaradei et al. Gene X. .

Abstract

Background: The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes.

Results: With this in mind, we developed the Splice2Deep models for SS detection. Each model is an ensemble of deep convolutional neural networks. We evaluated the performance of the models based on the ability to detect SS in Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. Results demonstrate that the models efficiently detect SS in other organisms not considered during the training of the models. Compared to the state-of-the-art tools, Splice2Deep models achieved significantly reduced average error rates of 41.97% and 28.51% for acceptor and donor SS, respectively. Moreover, the Splice2Deep cross-organism validation demonstrates that models correctly identify conserved genomic elements enabling annotation of SS in new genomes by choosing the taxonomically closest model.

Conclusions: The results of our study demonstrated that Splice2Deep both achieved a considerably reduced error rate compared to other state-of-the-art models and the ability to accurately recognize SS in other organisms for which the model was not trained, enabling annotation of poorly studied or newly sequenced genomes. Splice2Deep models are implemented in Python using Keras API; the models and the data are available at https://github.com/SomayahAlbaradei/Splice_Deep.git.

Keywords: AUC, area under curve; AcSS, acceptor splice site; Acc, accuracy; Bioinformatics; CNN, convolutional neural network; CONV, convolutional layers; DL, deep learning; DNA, deoxyribonucleic acid; DT, decision trees; Deep-learning; DoSS, donor splice site; FC, fully connected layer; ML, machine learning; NB, naive Bayes; NN, neural network; POOL, pooling layer; Prediction; RF, random forest; RNA, ribonucleic acid; ReLU, rectified linear unit layer; SS, splice site; SVM, support vector machine; Sn, sensitivity; Sp, specificity; Splice sites; Splicing.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests.

Figures

Fig. 1
Fig. 1
Accuracy results obtained from the cross-organism model validation. A–E) Cross-organism validation results for the prediction of AcSS, F–J) cross-organism validation results for the prediction of DoSS.
Fig. 2
Fig. 2
Data representation. A) Mononucleotide embedding with length (4 × L), and B) trinucleotide embedding with length (64 × L).
Fig. 3
Fig. 3
Splice2Deep model overview. Local and surrounding windows. ‘SS’ refers to splice site and ‘N’ to nucleotides.
Fig. 4
Fig. 4
Splice2Deep learning model. It takes DNA sequence as input embedded in 2D (either 4 × L or 64 × L), apply k motif detectors (filters), max pooling, flatten, fully connected layer using SoftMax to output scores.

References

    1. Albalawi F., Chahid A., Guo X., Albaradei S., Magana-Mora A., Jankovic B.R., Uludag M., Neste C.V., Essack M., Laleg-Kirati T.-M. Hybrid model for efficient prediction of Poly (A) signals in human genomic DNA. Methods. 2019;166:31–39. - PubMed
    1. Alipanahi B., Delong A., Weirauch M.T., Frey B.J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015;33(8):831. - PubMed
    1. Alshahrani M., Soufan O., Magana-Mora A., Bajic V.B. DANNP: an efficient artificial neural network pruning tool. PeerJ Computer Science. 2017;3
    1. Ashoor H., Magana-Mora A., Jankovic B.R., Kamau A., Awara K., Chowdary R., Archer J.A.C., Bajic V.B. Recognition of translation initiation sites in Arabidopsis thaliana. In: Lecca P., Tulpan D., Rajaraman K., editors. Systemic Approaches in Bioinformatics and Computational Systems Biology: Recent Advances. IGI Global; 2011. pp. 105–116.
    1. Bari A., Reaz M.R., Jeong B.-S. Effective DNA encoding for splice site prediction using SVM. MATCH Commun Math Comput Chem. 2014;71:241–258.