Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar 14;11(Suppl 2):9.
doi: 10.1186/s12918-017-0390-8.

Selecting high-quality negative samples for effectively predicting protein-RNA interactions

Affiliations

Selecting high-quality negative samples for effectively predicting protein-RNA interactions

Zhanzhan Cheng et al. BMC Syst Biol. .

Abstract

Background: The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process.

Methods: For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets.

Results: Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively.

Conclusion: Our method is useful to the identification of PRIs.

Keywords: Protein-RNA interactions; Reliable negative samples; Unreliable negative samples.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The framework of this work. Here, rectangles are executive modules, and parallelograms are data modules
Fig. 2
Fig. 2
The flowchart of constructing random negative samples
Fig. 3
Fig. 3
The flowchart of constructing reliable negative samples
Fig. 4
Fig. 4
Experimental results on SO datasets. ad are the SE, SP, GM and ACC values of SVM classifiers; (e)–(h) are the SE, SP, GM and ACC values of RF classifiers; and (i)–(l) are the SE, SP, GM and ACC values of NB classifiers
Fig. 5
Fig. 5
Experimental results on HOMO datasets. ad are the SE, SP, GM and ACC values of SVM classifiers; (e)–(h) are the SE, SP, GM and ACC values of RF classifiers; and (i)–(l) are the SE, SP, GM and ACC values of NB classifiers
Fig. 6
Fig. 6
Experimental results on MUS datasets. ad are the SE, SP, GM and ACC values of SVM classifiers; (e)–(h) are the SE, SP, GM and ACC values of RF classifiers; and (i)–(l) are the SE, SP, GM and ACC values of NB classifiers
Fig. 7
Fig. 7
AUC vs. score threshold (RF, SVM and NB)
Fig. 8
Fig. 8
The PRI network constructed by true PRIs and predicted ones by our method. 256 predicted PRIs consist of 74 unique proteins and 56 unique RNAs. The yellow ellipses and purple diamonds represent proteins and RNAs, respectively. The solid and dotted lines are the true and predicted PRIs

References

    1. Moore PB. The three-dimensional structure of the ribosome and its components. Annu Rev Biophys Biomol Struct. 1998;27(1):35–58. doi: 10.1146/annurev.biophys.27.1.35. - DOI - PubMed
    1. Moras D. Structural and functional relationships between aminoacyl-tRNA synthetases. Trends Biochem Sci. 1992;17(4):159–64. doi: 10.1016/0968-0004(92)90326-5. - DOI - PubMed
    1. Ramakrishnan V, White SW. Ribosomal protein structures: Insights into the architecture, machinery and evolution of the ribosome. Trends Biochem Sci. 1998;23(6):208–12. doi: 10.1016/S0968-0004(98)01214-6. - DOI - PubMed
    1. Mata J, Marguerat S, Bähler J. Post-transcriptional control of gene expression: A genome-wide perspective. Trends Biochem Sci. 2005;30(9):506–14. doi: 10.1016/j.tibs.2005.07.005. - DOI - PubMed
    1. Siomi H, Dreyfuss G. RNA-binding proteins as regulators of gene expression. Curr Opin Genet Dev. 1997;7(3):345–53. doi: 10.1016/S0959-437X(97)80148-7. - DOI - PubMed

Publication types