Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb 26:8:134.
doi: 10.3389/fbioe.2020.00134. eCollection 2020.

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Affiliations

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Zhibin Lv et al. Front Bioeng Biotechnol. .

Abstract

One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.

Keywords: RNA; light gradient boosting; machine learning; pseudouridine sites; random forest.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
A schematic diagram of RF-PseU. RNA sequences with or without pseudouridine sites were encoded via seven RNA coding technologies; following removal of redundant features by light gradient boosting machine feature selection, the random forest model was trained on smaller but more relevant feature vector spaces, and was evaluated through cross-validation and independent testing to obtain an optimized model for prediction.
FIGURE 2
FIGURE 2
(A) Accuracy of the random forest predictor varied with feature dimension for all three species: (A1) H. sapiens; (A2) S. cerevisiae; (A3) M. musculus. The best independent accuracies for H. sapiens and S. cerevisiae were 75.0% with 257 features and 77.0% with 397 features, respectively, and the best 10-Fold cross-validated accuracy for M. musculus was 74.8% with 161 features. (B) Receiver operating characteristic curve (ROC) and area under the ROC curve (auROC) for different species under various conditions. (B1) is for H. sapiens, (B2) is for S. cerevisiae and (B3) is M. musculus. A support vector machine (SVM) was used for comparison with the random forest (RF) model. 10-Fold (10-Fold) model testing and leave-one-out (LOO) model testing indicate the model with the best 10-Fold and LOO cross-validation scores in independent testing. In cross-validation (10-Fold and LOO) and testing process, the training datasets have divided into training part and validation part. That is, they have used the general machine learning evaluation methods (training, validation and testing) for model optimization. In the figure, the 10-fold cross-validation and LOO cross-validation metric values are obtained from the validation part of training part, while the independent testing metric values are obtained from the independent testing datasets.
FIGURE 3
FIGURE 3
A screenshot of RF-PseU web server interface. The web server allows users to type or paste FASTA format text into the textbox and click submit button; the results are displayed in the right-hand table.

References

    1. Agris P. F. (2008). Bringing order to translation: the contributions of transfer RNA anticodon-domain modifications. Embo Rep. 9 629–635. 10.1038/embor.2008.104 - DOI - PMC - PubMed
    1. Boccaletto P., Machnicka M. A., Purta E., Piatkowski P., Baginski B., Wirecki T. K. (2018). MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res. 46 D303–D307. 10.1093/nar/gkx1030 - DOI - PMC - PubMed
    1. Cao R., Cheng J. (2016). Protein single-model quality assessment by feature-based probability density functions. Sci. Rep. 6:23990. 10.1038/srep23990 - DOI - PMC - PubMed
    1. Carlile T. M., Rojas-Duran M. F., Gilbert W. V. (2015). Pseudo-seq: genome-wide detection of pseudouridine modifications in RNA. Methods Enzymol. 560 219–245. 10.1016/bs.mie.2015.03.011 - DOI - PMC - PubMed
    1. Carlile T. M., Rojas-Duran M. F., Zinshteyn B., Shin H., Bartoli K. M., Gilbert W. V. (2014). Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature 515 143–146. 10.1038/nature13802 - DOI - PMC - PubMed