RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Zhibin Lv¹, Jun Zhang², Hui Ding³, Quan Zou^{1

3}

Affiliations

¹ Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
² Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China.
³ Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.

PMID: 32175316
PMCID: PMC7054385
DOI: 10.3389/fbioe.2020.00134

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Zhibin Lv et al. Front Bioeng Biotechnol. 2020.

. 2020 Feb 26:8:134.

doi: 10.3389/fbioe.2020.00134. eCollection 2020.

Authors

Zhibin Lv¹, Jun Zhang², Hui Ding³, Quan Zou^{1

3}

Affiliations

¹ Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
² Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China.
³ Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.

PMID: 32175316
PMCID: PMC7054385
DOI: 10.3389/fbioe.2020.00134

Abstract

One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.

Keywords: RNA; light gradient boosting; machine learning; pseudouridine sites; random forest.

PubMed Disclaimer

Figures

**FIGURE 1**
A schematic diagram of RF-PseU. RNA sequences with or without pseudouridine sites were encoded via seven RNA coding technologies; following removal of redundant features by light gradient boosting machine feature selection, the random forest model was trained on smaller but more relevant feature vector spaces, and was evaluated through cross-validation and independent testing to obtain an optimized model for prediction.

**FIGURE 2**
**(A)** Accuracy of the random forest predictor varied with feature dimension for all three species: **(A1)** *H. sapiens*; **(A2)** *S. cerevisiae*; **(A3)** *M. musculus*. The best independent accuracies for *H. sapiens* and *S. cerevisiae* were 75.0% with 257 features and 77.0% with 397 features, respectively, and the best 10-Fold cross-validated accuracy for *M. musculus* was 74.8% with 161 features. **(B)** Receiver operating characteristic curve (ROC) and area under the ROC curve (auROC) for different species under various conditions. **(B1)** is for *H. sapiens*, **(B2)** is for *S. cerevisiae* and **(B3)** is *M. musculus*. A support vector machine (SVM) was used for comparison with the random forest (RF) model. 10-Fold (10-Fold) model testing and leave-one-out (LOO) model testing indicate the model with the best 10-Fold and LOO cross-validation scores in independent testing. In cross-validation (10-Fold and LOO) and testing process, the training datasets have divided into training part and validation part. That is, they have used the general machine learning evaluation methods (training, validation and testing) for model optimization. In the figure, the 10-fold cross-validation and LOO cross-validation metric values are obtained from the validation part of training part, while the independent testing metric values are obtained from the independent testing datasets.

**FIGURE 3**
A screenshot of RF-PseU web server interface. The web server allows users to type or paste FASTA format text into the textbox and click submit button; the results are displayed in the right-hand table.

See this image and copyright information in PMC

References

1. Agris P. F. (2008). Bringing order to translation: the contributions of transfer RNA anticodon-domain modifications. Embo Rep. 9 629–635. 10.1038/embor.2008.104 - DOI - PMC - PubMed
1. Boccaletto P., Machnicka M. A., Purta E., Piatkowski P., Baginski B., Wirecki T. K. (2018). MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res. 46 D303–D307. 10.1093/nar/gkx1030 - DOI - PMC - PubMed
1. Cao R., Cheng J. (2016). Protein single-model quality assessment by feature-based probability density functions. Sci. Rep. 6:23990. 10.1038/srep23990 - DOI - PMC - PubMed
1. Carlile T. M., Rojas-Duran M. F., Gilbert W. V. (2015). Pseudo-seq: genome-wide detection of pseudouridine modifications in RNA. Methods Enzymol. 560 219–245. 10.1016/bs.mie.2015.03.011 - DOI - PMC - PubMed
1. Carlile T. M., Rojas-Duran M. F., Zinshteyn B., Shin H., Bartoli K. M., Gilbert W. V. (2014). Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature 515 143–146. 10.1038/nature13802 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Affiliations

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources