Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec 27;20(Suppl 23):652.
doi: 10.1186/s12859-019-3306-3.

SpliceFinder: ab initio prediction of splice sites using convolutional neural network

Affiliations

SpliceFinder: ab initio prediction of splice sites using convolutional neural network

Ruohan Wang et al. BMC Bioinformatics. .

Abstract

Background: Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing.

Result: We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining.

Conclusion: Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

Keywords: Canonical and non-canonical splice sites; Convolutional neural network; Splice site prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The architecture of our proposed CNN. The input of the neural network is the encoded DNA sequence with the length of L. The first layer is a 1-D convolutional layer, consists of 50 kernels with the size of 9. The second layer is a fully connected layer with 100 neurons, followed by a dropout layer. Another fully connected layer and softmax activation function are applied for the final prediction
Fig. 2
Fig. 2
The iterative approach for negative set reconstruction. At each iteration, the trained CNN is tested with a randomly chosen genomic sequence, the false positives are collected and added to the training data, which will be used to train our CNN at the next iteration
Fig. 3
Fig. 3
The effect of sequence length on accuracy. Varying the sequence lengths from 40 to 400 nt, the classification accuracies for the test set of initial dataset and reconstructed dataset are compared
Fig. 4
Fig. 4
The sequence logos and average weighted contribution scores of nucleotides near the splice site. For donor sites, acceptor sites, and non-splice-sites with canonical signals, the average weighted contribution scores of different models for each nucleotide near the splice site (located at the position between 200 and 201) is shown. From left to right, the models are generated from the 1st, 50th, and 100th iteration. The sequence logos are made [32] to show the difference of patterns between true and false splice sites. a Donor. b Non-splice-site with GT dimers. c Acceptor. d Non-splice-site with AG dimers
Fig. 5
Fig. 5
Comparison of classification performance of different methods on the test set of the reconstructed dataset. The compared measures include (a) classification accuracy; (b) ROC curve for donor sites (left) and acceptor sites (right); (c) Precision-recall curve for donor sites (left) and acceptor sites (right)
Fig. 6
Fig. 6
The prediction performance improves after dataset reconstruction. a Using the models generated in the iterative process to predict the splice sites on three randomly chosen genomic sequences, false positive numbers of both donor site and acceptor site are shown. The false positive numbers of the initial model are set as 100%. b The comparison of accuracy, recall, and false positives numbers between models with and without dataset reconstruction
Fig. 7
Fig. 7
Comparison of recall of different softwares for donor sites of Genomic Sequence III. Using different score cutoff or models generated in the iterative process, the recall values of the four softwares, for donor sites of Genomic Sequence III, are calculated
Fig. 8
Fig. 8
The splice site prediction accuracy of our models for other species. For (a) Drosophila melanogaster, (b) Mus musculus, (c) Rattus, and (d) Danio rerio, the models generated in the iterative process are applied to predicting the splice sites on three randomly chosen genomic sequences
Fig. 9
Fig. 9
The false positive numbers and recall of our models for other species. For (a) Drosophila melanogaster, (b) Mus musculus, (c) Rattus, and (d) Danio rerio, the numbers of false positive and values of recall are calculated to show more details of the prediction performance for other species

Similar articles

Cited by

References

    1. Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller K-R, Sommer R-J, Schölkopf B. Improving the caenorhabditis elegans genome annotation using machine learning. PLoS Comput Biol. 2007;3(2):20. - PMC - PubMed
    1. Reese MG, Eeckman FH, Kulp D, Haussler D. Improved splice site detection in genie. J Comput Biol. 1997;4(3):311–23. - PubMed
    1. Breathnach R, Benoist C, O’hare K, Gannon F, Chambon P. Ovalbumin gene: evidence for a leader sequence in mrna and dna sequences at the exon-intron boundaries. Proc Natl Acad Sci. 1978;75(10):4853–7. - PMC - PubMed
    1. Mount SM. A catalogue of splice junction sequences. Nucleic Acids Res. 1982;10(2):459–72. - PMC - PubMed
    1. Hodge MR, Cumsky MG. Splicing of a yeast intron containing an unusual 5’junction sequence. Mol Cell Biol. 1989;9(6):2765–70. - PMC - PubMed

Substances

LinkOut - more resources