Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 30;20(1):409.
doi: 10.1186/s12859-019-2999-7.

IRESpy: an XGBoost model for prediction of internal ribosome entry sites

Affiliations

IRESpy: an XGBoost model for prediction of internal ribosome entry sites

Junhui Wang et al. BMC Bioinformatics. .

Abstract

Background: Internal ribosome entry sites (IRES) are segments of mRNA found in untranslated regions that can recruit the ribosome and initiate translation independently of the 5' cap-dependent translation initiation mechanism. IRES usually function when 5' cap-dependent translation initiation has been blocked or repressed. They have been widely found to play important roles in viral infections and cellular processes. However, a limited number of confirmed IRES have been reported due to the requirement for highly labor intensive, slow, and low efficiency laboratory experiments. Bioinformatics tools have been developed, but there is no reliable online tool.

Results: This paper systematically examines the features that can distinguish IRES from non-IRES sequences. Sequence features such as kmer words, structural features such as QMFE, and sequence/structure hybrid features are evaluated as possible discriminators. They are incorporated into an IRES classifier based on XGBoost. The XGBoost model performs better than previous classifiers, with higher accuracy and much shorter computational time. The number of features in the model has been greatly reduced, compared to previous predictors, by including global kmer and structural features. The contributions of model features are well explained by LIME and SHapley Additive exPlanations. The trained XGBoost model has been implemented as a bioinformatics tool for IRES prediction, IRESpy (https://irespy.shinyapps.io/IRESpy/), which has been applied to scan the human 5' UTR and find novel IRES segments.

Conclusions: IRESpy is a fast, reliable, high-throughput IRES online prediction tool. It provides a publicly available tool for all IRES researchers, and can be used in other genomics applications such as gene annotation and analysis of differential gene expression.

Keywords: Bioinformatics; Internal ribosome entry site (IRES); Machine learning; XGBoost.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Calculation of Kmer features. An example of kmer features in the Cricket paralysis virus (CrPV) intergenic region (IGR) are shown. From 1mer to 4mer examples are shown. The red and green boxes show examples of the observation window used to calculate local kmers. 340 global kmers and 5440 local kmers have been tested in this research
Fig. 2
Fig. 2
QMFE calculation examples of IRES and non-IRES sequences. a PMFE of randomized sequences (density plot) and PMFE of the CrPV IGR IRES (viral IRES, PMFE = -47.5, QMFE = 0.001), the ERH 5′ UTR (housekeeping gene, PMFE = -12.7, QMFE = 0.99), Apaf-1 cellular IRES (PMFE = -76, QMFE = 0.66), and CrPV non-IRES regions (position: 6200–6399, PMFE = -22.2, QMFE = 0.94). b QMFE of 200 base segments across the whole genomic CrPV mRNA. The QMFE shows minimal values in the regions of the known the 5’UTR IRES (bases 1–708) and IGR IRES (bases 6000–6200)
Fig. 3
Fig. 3
Calculation of triplet features. An example of triplet features in the Cricket paralysis virus (CrPV) intergenic region (IGR) are shown. The secondary structure of the candidate sequence was predicted using UNAfold [29]. For each nucleotide, only two states are possible, paired or unpaired. Parenthesess “()” or dots “.” represent the paired and unpaired nucleotides in the predicted secondary structure, respectively. For any 3 adjacent bases, there are 8 possible structural states: “(((”, “((.”, “(..”,“(.(”,“.((”,“.(.”,“..(”, and” …”. Triplet features comprise the structural states plus the identity of the central base, A, C, G, or U, so there are 32 (8*4 = 32) triplet features in total. Triplet features are normalized by dividing the observed number of each triplet by the total number of all the triplet features
Fig. 4
Fig. 4
Model performance of XGBoost and GBDT. a The model performance of XGBoost and GBDT for only the global kmer features, without any hyperparameter tuning. b Model performance comparison using area under the ROC curve (AUC). The XGBoost model has lower training AUC but higher testing AUC than the GBDT model. The XGBoost model trained with only local kmers performs the same as the GBDT model, but the number of features is reduced from 5780 to 340
Fig. 5
Fig. 5
Effect of incorporating structural features. QMFE and triplet features were included in a combined model with global kmer features. We examined models incorporating only global kmer features, only structural features, and a combination of global kmer and structural features
Fig. 6
Fig. 6
XGBoost model feature importance explained by SHAP values at the global scale. a The summary of SHAP values of the top 20 important features for model including both global kmers and local kmers. b The summary of SHAP values of the top 20 important features for models including only global kmers. c The summary of SHAP values of the top 20 important features for models including both global kmers and structural features. d The summary of SHAP value of the top 20 important features for model including only structural features
Fig. 7
Fig. 7
XGBoost model feature importance explained by SHAP and LIME at a local scale. a SHAP (SHapley Additive exPlanation) dependence plots of the importance of the UUU and GA kmers in the XGBoost model. b Local Interpretable Model-agnostic Explanations (LIME) for the CrPV IGR IRES and CrPV protein coding sequence. The green bar shows the weighted features that support classification as IRES and red bars are the weighted features that oppose classification as IRES
Fig. 8
Fig. 8
Correlation between IRESpy prediction and experimental results
Fig. 9
Fig. 9
The density distribution of predicted IRES probability in Dataset 2 and human UTR scan
Fig. 10
Fig. 10
Predicted probability of IRES for highly structured RNA families, and IRES and non-IRES classes in Datasets 1 and 2

Similar articles

Cited by

References

    1. Bonnet E, Wuyts J, Rouzé P, Van de Peer Y. Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics. 2004;20(17):2911–2917. doi: 10.1093/bioinformatics/bth374. - DOI - PubMed
    1. Chen T, Guestrin C. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. Xgboost: A scalable tree boosting system.
    1. Clote P, Ferre F, Kranakis E, Krizanc D. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA. 2005;11(5):578–591. doi: 10.1261/rna.7220505. - DOI - PMC - PubMed
    1. Costantino D, Kieft JS. A preformed compact ribosome-binding domain in the cricket paralysis-like virus IRES RNAs. RNA. 2005;11(3):332–343. doi: 10.1261/rna.7184705. - DOI - PMC - PubMed
    1. Fernandez-Miragall O, Martinez-Salas E. Structural organization of a viral IRES depends on the integrity of the GNRA motif. RNA. 2003;9(11):1333–1344. doi: 10.1261/rna.5950603. - DOI - PMC - PubMed