Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study

Prabina Kumar Meher¹, Subhrajit Satpathy¹

Affiliations

PMID: 34790508
PMCID: PMC8558126
DOI: 10.1007/s13205-021-03036-8

Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study

Prabina Kumar Meher et al. 3 Biotech. 2021 Nov.

. 2021 Nov;11(11):484.

doi: 10.1007/s13205-021-03036-8. Epub 2021 Oct 31.

Authors

Prabina Kumar Meher¹, Subhrajit Satpathy¹

Affiliation

¹ ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India.

PMID: 34790508
PMCID: PMC8558126
DOI: 10.1007/s13205-021-03036-8

Abstract

Identification of splice sites is an important aspect with regard to the prediction of gene structure. In most of the existing splice site prediction studies, machine learning algorithms coupled with sequence-derived features have been successfully employed for splice site recognition. However, the splice site identification by incorporating the secondary structure information is lacking, particularly in plant species. Thus, we made an attempt in this study to evaluate the performance of structural features on the splice site prediction accuracy in Arabidopsis thaliana. Prediction accuracies were evaluated with the sequence-derived features alone as well as by incorporating the structural features into the sequence-derived features, where support vector machine (SVM) was employed as prediction algorithm. Both short (40 base pairs) and long (105 base pairs) sequence datasets were considered for evaluation. After incorporating the secondary structure features, improvements in accuracies were observed only for the longer sequence dataset and the improvement was found to be higher with the sequence-derived features that accounted nucleotide dependencies. On the other hand, either a little or no improvement in accuracies was found for the short sequence dataset. The performance of SVM was further compared with that of LogitBoost, Random Forest (RF), AdaBoost and XGBoost machine learning methods. The prediction accuracies of SVM, AdaBoost and XGBoost were observed to be at par and higher than that of RF and LogitBoost algorithms. While prediction was performed by taking all the sequence-derived features along with the structural features, a little improvement in accuracies was found as compared to the combination of individual sequence-based features and structural features. To the best of our knowledge, this is the first attempt concerning the computational prediction of splice sites using machine learning methods by incorporating the secondary structure information into the sequence-derived features. All the source codes are available at https://github.com/meher861982/SSFeature.

Supplementary information: The online version contains supplementary material available at 10.1007/s13205-021-03036-8.

Keywords: Computational biology; Machine learning; Nucleotide dependencies; Secondary structure; Splice junction.

PubMed Disclaimer

Conflict of interest statement

Conflict of interestThe author declares that there is no conflict of interest.

Figures

**Fig. 1**
Flow diagram showing the steps involved in the present approach for splice site prediction using machine learning algorithms

**Fig. 2**
Bar plots depicting the estimates of performance metrics for prediction with exonic and intronic false sites. Performance metrics were computed by considering false splice site sequences collected from both exonic and intronic regions. Here, longer sequence dataset (105 bp) was utilized for prediction. Performance metrics were found to be little higher with the exonic false sites, while all the performance metrics and encoding schemes were accounted

**Fig. 3**
ROC and PR curves of different machine learning algorithms for prediction with sequence-derived and structural features. The performances were evaluated using longer sequence dataset, where the false splice site sequences from the exonic regions were utilized. The performance of SVM, AdaBoost, and XGBoost were found to be similar and better than that of RF and LogitBoost learning methods

**Fig. 4**
ROC and PR curves along with the estimates of auROC and auPRC for different machine learning methods. Performance metrics were computed using longer sequence dataset, where exonic false sites were utilized. Performances were evaluated with all the sequence-derived features along with structural features. The prediction accuracies were found to be little higher as compared to the combination of individual sequence-based and structural features. Also, the performances of SVM, AdaBoost, and XGBoost were found at par and better than that of RF and LogitBoost

**Fig. 5**
Graphical representation of information gain for all the sequence-based and structural features. Information gain was computed for the negative datasets from the exonic and intronic regions. It can be seen that for the sequence-derived features, information gains are higher for the features generated around splicing junctions

See this image and copyright information in PMC

References

1. Albaradei S, Magana-Mora A, Thafar MA, et al. Splice2Deep: an ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA. Gene. 2020 doi: 10.1016/j.gene.2020.100035. - DOI - PubMed
1. Alfaro E, Gamez M, García N. adabag: an R package for classification with boosting and bagging. J Stat Softw. 2013;54:1–35. doi: 10.18637/jss.v054.i02. - DOI
1. Bari ATM, Reaz M, Jeong B-S. Effective DNA encoding for splice site prediction using SVM. Match (mulheim an Der Ruhr, Germany) 2013;71:241–258.
1. Baten AKMA, Chang BCH, Halgamuge SK, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinform. 2006;7(Suppl 5):S15. doi: 10.1186/1471-2105-7-S5-S15. - DOI - PMC - PubMed
1. Breiman L. Random forests. Mach Learn. 2001 doi: 10.1023/A:1010933404324. - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study

Affiliation

Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources