Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jan 26:5:8034.
doi: 10.1038/srep08034.

A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks

Affiliations

A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks

Suyu Mei et al. Sci Rep. .

Abstract

Protein-protein interaction (PPI) prediction is generally treated as a problem of binary classification wherein negative data sampling is still an open problem to be addressed. The commonly used random sampling is prone to yield less representative negative data with considerable false negatives. Meanwhile rational constraints are seldom exerted on model selection to reduce the risk of false positive predictions for most of the existing computational methods. In this work, we propose a novel negative data sampling method based on one-class SVM (support vector machine, SVM) to predict proteome-wide protein interactions between HTLV retrovirus and Homo sapiens, wherein one-class SVM is used to choose reliable and representative negative data, and two-class SVM is used to yield proteome-wide outcomes as predictive feedback for rational model selection. Computational results suggest that one-class SVM is more suited to be used as negative data sampling method than two-class PPI predictor, and the predictive feedback constrained model selection helps to yield a rational predictive model that reduces the risk of false positive predictions. Some predictions have been validated by the recent literature. Lastly, gene ontology based clustering of the predicted PPI networks is conducted to provide valuable cues for the pathogenesis of HTLV retrovirus.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. One-class SVM OCSVM(ν, γ) parameters tuning.
Eight representative parameter pairs (ν, γ) are chosen from the parameter space according to OCSVM(ν, γ) LOOCV performance. The blue bars illustrate the recognition rate of the training positive data and the brown bars illustrate the predicted positive rate of proteome-wide predictions by OCSVM(ν, γ).
Figure 2
Figure 2. Two-class SVM TCSVM(ν, γ) LOOCV ROC curves.
For each representative parameter pair (ν, γ), one negative dataset is sampled from the predicted outcomes of the correponding OCSVM(ν, γ). The sampled negative data are merged with the positive data to train two-class SVM TCSVM(ν, γ). The ROC curves and corresponding AUC scores are used to estimate the quality of the negative data sampled by OCSVM(ν, γ).
Figure 3
Figure 3. Two-class SVM TCSVM(ν, γ) LOOCV performance on dataset S, S1, S2 and S3.
For each representative parameter pair (ν, γ), negative datasets are sampled from the predicted outcomes of the correponding OCSVM(ν, γ) to be the negative data of dataset S, S1, S2 and S3, and then train four two-class SVM TCSVM(ν, γ). The blue bars denote TCSVM(ν, γ) LOOCV Accuracy and the brown bars denote TCSVM(ν, γ) LOOCV MCC.
Figure 4
Figure 4. Two-class SVM TCSVM(ν, γ) proteome-wide predicted positive rates.
From the predicted posisitve rates, K values derived are derived to be used as constraint on OCSVM(ν, γ) model selection. Lower bar signifies higher K value.
Figure 5
Figure 5. Percentage of HTLV targeted human proteins predicted by two-class SVM TCSVM(ν, γ).
Lower bars are supposed to signify lower risk of false positive predictions. The metric together with K value is used as constraint on model selection of one-class SVM OCSVM(ν, γ).
Figure 6
Figure 6. Details of percentage of HTLV targeted human proteins predicted by TCSVM(ν, γ).
From the metric, K value can be derived for each HTLV protein to conduct fine-grained model selection of one-class SVM OCSVM(ν, γ). The parameter pair (ν, γ) with more lower bars are preferred.
Figure 7
Figure 7. Comparative ROC curves between the final model TCSVM(ν = 1, γ = 3) and random sampling model TCSVMrandom.
From the points of view of AUC scores, TCSVM(ν = 1, γ = 3) outperforms TCSVMrandom.
Figure 8
Figure 8. Gene ontology based clustering of predicted PPI subnetworks - biological processes.
Three human signaling pathways predicted to be targeted by HTLV proteins are illustrated as examples: formula image GO:0007219 - Notch signaling pathway. formula image GO:0050852 - T cell receptor signaling pathway. formula image GO:0046426 - negative regulation of JAK-STAT cascade. The diamond denotes HTLV proteins and the ecllipse circle denotes human proteins.
Figure 9
Figure 9. Gene ontology based clustering of predicted PPI subnetworks - molecular functions.
Three human Molecular functional modules predicted to be targeted by HTLV proteins are illustrated as examples: formula image GO:0017124 - SH3 domain binding. formula image GO:0002039 - p53 binding. formula image GO:0004553 - hydrolase activity. The diamond denotes HTLV proteins and the ecllipse circle denotes human proteins.

Similar articles

Cited by

References

    1. Gonzalez M. W., Kann M. G. Chapter 4: Protein Interactions and Disease. PLoS Comput Biol 8, e1002819 (2012). - PMC - PubMed
    1. Jansen R., Gerstein M. Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Curr Opin Microbiol 7, 535–545 (2004). - PubMed
    1. Shoemaker B. A., Panchenko A. R. Deciphering protein–protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol 3, e42 (2007). - PMC - PubMed
    1. Costanzo M., Baryshnikova A., Bellay J., Kim Y., Spear E. D. et al. The genetic landscape of a cell. Science 327, 425–431 (2010). - PMC - PubMed
    1. Dyer M., Murali T., Sobral B. Computational prediction of host-pathogen protein-protein interactions. Bioinformatics 23, i159–i166 (2007). - PubMed

Publication types

LinkOut - more resources