. 2019 Jul 30;20(1):409.

doi: 10.1186/s12859-019-2999-7.

IRESpy: an XGBoost model for prediction of internal ribosome entry sites

Junhui Wang¹, Michael Gribskov²

Affiliations

¹ Biological Sciences Department, Purdue University, West Lafayette, IN, USA.
² Biological Sciences Department, Purdue University, West Lafayette, IN, USA. mgribsko@purdue.edu.

PMID: 31362694
PMCID: PMC6664791
DOI: 10.1186/s12859-019-2999-7

IRESpy: an XGBoost model for prediction of internal ribosome entry sites

Junhui Wang et al. BMC Bioinformatics. 2019.

. 2019 Jul 30;20(1):409.

doi: 10.1186/s12859-019-2999-7.

Authors

Junhui Wang¹, Michael Gribskov²

Affiliations

¹ Biological Sciences Department, Purdue University, West Lafayette, IN, USA.
² Biological Sciences Department, Purdue University, West Lafayette, IN, USA. mgribsko@purdue.edu.

PMID: 31362694
PMCID: PMC6664791
DOI: 10.1186/s12859-019-2999-7

Abstract

Background: Internal ribosome entry sites (IRES) are segments of mRNA found in untranslated regions that can recruit the ribosome and initiate translation independently of the 5' cap-dependent translation initiation mechanism. IRES usually function when 5' cap-dependent translation initiation has been blocked or repressed. They have been widely found to play important roles in viral infections and cellular processes. However, a limited number of confirmed IRES have been reported due to the requirement for highly labor intensive, slow, and low efficiency laboratory experiments. Bioinformatics tools have been developed, but there is no reliable online tool.

Results: This paper systematically examines the features that can distinguish IRES from non-IRES sequences. Sequence features such as kmer words, structural features such as Q_MFE, and sequence/structure hybrid features are evaluated as possible discriminators. They are incorporated into an IRES classifier based on XGBoost. The XGBoost model performs better than previous classifiers, with higher accuracy and much shorter computational time. The number of features in the model has been greatly reduced, compared to previous predictors, by including global kmer and structural features. The contributions of model features are well explained by LIME and SHapley Additive exPlanations. The trained XGBoost model has been implemented as a bioinformatics tool for IRES prediction, IRESpy (https://irespy.shinyapps.io/IRESpy/), which has been applied to scan the human 5' UTR and find novel IRES segments.

Conclusions: IRESpy is a fast, reliable, high-throughput IRES online prediction tool. It provides a publicly available tool for all IRES researchers, and can be used in other genomics applications such as gene annotation and analysis of differential gene expression.

Keywords: Bioinformatics; Internal ribosome entry site (IRES); Machine learning; XGBoost.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Calculation of Kmer features. An example of kmer features in the Cricket paralysis virus (CrPV) intergenic region (IGR) are shown. From 1mer to 4mer examples are shown. The red and green boxes show examples of the observation window used to calculate local kmers. 340 global kmers and 5440 local kmers have been tested in this research

**Fig. 2**
Q_MFE calculation examples of IRES and non-IRES sequences. a PMFE of randomized sequences (density plot) and PMFE of the CrPV IGR IRES (viral IRES, PMFE = -47.5, Q_MFE = 0.001), the ERH 5′ UTR (housekeeping gene, PMFE = -12.7, Q_MFE = 0.99), Apaf-1 cellular IRES (PMFE = -76, Q_MFE = 0.66), and CrPV non-IRES regions (position: 6200–6399, PMFE = -22.2, Q_MFE = 0.94). b Q_MFE of 200 base segments across the whole genomic CrPV mRNA. The Q_MFE shows minimal values in the regions of the known the 5’UTR IRES (bases 1–708) and IGR IRES (bases 6000–6200)

**Fig. 3**
Calculation of triplet features. An example of triplet features in the Cricket paralysis virus (CrPV) intergenic region (IGR) are shown. The secondary structure of the candidate sequence was predicted using UNAfold [29]. For each nucleotide, only two states are possible, paired or unpaired. Parenthesess “()” or dots “.” represent the paired and unpaired nucleotides in the predicted secondary structure, respectively. For any 3 adjacent bases, there are 8 possible structural states: “(((”, “((.”, “(..”,“(.(”,“.((”,“.(.”,“..(”, and” …”. Triplet features comprise the structural states plus the identity of the central base, A, C, G, or U, so there are 32 (8*4 = 32) triplet features in total. Triplet features are normalized by dividing the observed number of each triplet by the total number of all the triplet features

**Fig. 4**
Model performance of XGBoost and GBDT. a The model performance of XGBoost and GBDT for only the global kmer features, without any hyperparameter tuning. b Model performance comparison using area under the ROC curve (AUC). The XGBoost model has lower training AUC but higher testing AUC than the GBDT model. The XGBoost model trained with only local kmers performs the same as the GBDT model, but the number of features is reduced from 5780 to 340

**Fig. 5**
Effect of incorporating structural features. QMFE and triplet features were included in a combined model with global kmer features. We examined models incorporating only global kmer features, only structural features, and a combination of global kmer and structural features

**Fig. 6**
XGBoost model feature importance explained by SHAP values at the global scale. a The summary of SHAP values of the top 20 important features for model including both global kmers and local kmers. b The summary of SHAP values of the top 20 important features for models including only global kmers. c The summary of SHAP values of the top 20 important features for models including both global kmers and structural features. d The summary of SHAP value of the top 20 important features for model including only structural features

**Fig. 7**
XGBoost model feature importance explained by SHAP and LIME at a local scale. a SHAP (SHapley Additive exPlanation) dependence plots of the importance of the UUU and GA kmers in the XGBoost model. b Local Interpretable Model-agnostic Explanations (LIME) for the CrPV IGR IRES and CrPV protein coding sequence. The green bar shows the weighted features that support classification as IRES and red bars are the weighted features that oppose classification as IRES

**Fig. 8**
Correlation between IRESpy prediction and experimental results

**Fig. 9**
The density distribution of predicted IRES probability in Dataset 2 and human UTR scan

**Fig. 10**
Predicted probability of IRES for highly structured RNA families, and IRES and non-IRES classes in Datasets 1 and 2

See this image and copyright information in PMC

Cited by

Long non-coding RNA-encoded micropeptides: functions, mechanisms and implications.
Xiao Y, Ren Y, Hu W, Paliouras AR, Zhang W, Zhong L, Yang K, Su L, Wang P, Li Y, Ma M, Shi L. Xiao Y, et al. Cell Death Discov. 2024 Oct 23;10(1):450. doi: 10.1038/s41420-024-02175-0. Cell Death Discov. 2024. PMID: 39443468 Free PMC article. Review.
Development of machine learning model for diagnostic disease prediction based on laboratory tests.
Park DJ, Park MW, Lee H, Kim YJ, Kim Y, Park YH. Park DJ, et al. Sci Rep. 2021 Apr 7;11(1):7567. doi: 10.1038/s41598-021-87171-5. Sci Rep. 2021. PMID: 33828178 Free PMC article.
RNA-Binding Proteins as Regulators of Internal Initiation of Viral mRNA Translation.
López-Ulloa B, Fuentes Y, Pizarro-Ortega MS, López-Lastra M. López-Ulloa B, et al. Viruses. 2022 Jan 19;14(2):188. doi: 10.3390/v14020188. Viruses. 2022. PMID: 35215780 Free PMC article. Review.
Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning.
Sokhansanj BA, Rosen GL. Sokhansanj BA, et al. Comput Biol Med. 2022 Oct;149:105969. doi: 10.1016/j.compbiomed.2022.105969. Epub 2022 Aug 17. Comput Biol Med. 2022. PMID: 36041271 Free PMC article.
Parvovirus B19 and Human Parvovirus 4 Encode Similar Proteins in a Reading Frame Overlapping the VP1 Capsid Gene.
Karlin DG. Karlin DG. Viruses. 2024 Jan 26;16(2):191. doi: 10.3390/v16020191. Viruses. 2024. PMID: 38399966 Free PMC article.

See all "Cited by" articles

References

1. Bonnet E, Wuyts J, Rouzé P, Van de Peer Y. Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics. 2004;20(17):2911–2917. doi: 10.1093/bioinformatics/bth374. - DOI - PubMed
1. Chen T, Guestrin C. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. Xgboost: A scalable tree boosting system.
1. Clote P, Ferre F, Kranakis E, Krizanc D. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA. 2005;11(5):578–591. doi: 10.1261/rna.7220505. - DOI - PMC - PubMed
1. Costantino D, Kieft JS. A preformed compact ribosome-binding domain in the cricket paralysis-like virus IRES RNAs. RNA. 2005;11(3):332–343. doi: 10.1261/rna.7184705. - DOI - PMC - PubMed
1. Fernandez-Miragall O, Martinez-Salas E. Structural organization of a viral IRES depends on the integrity of the GNRA motif. RNA. 2003;9(11):1333–1344. doi: 10.1261/rna.5950603. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

IRESpy: an XGBoost model for prediction of internal ribosome entry sites

Affiliations

IRESpy: an XGBoost model for prediction of internal ribosome entry sites

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous