A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites
- PMID: 26441427
- DOI: 10.1109/TNB.2015.2475359
A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites
Abstract
Protein-protein interactions exist ubiquitously and play important roles in the life cycles of living cells. The interaction sites (residues) are essential to understanding the underlying mechanisms of protein-protein interactions. Previous research has demonstrated that the accurate identification of protein-protein interaction sites (PPIs) is helpful for developing new therapeutic drugs because many drugs will interact directly with those residues. Because of its significant potential in biological research and drug development, the prediction of PPIs has become an important topic in computational biology. However, a severe data imbalance exists in the PPIs prediction problem, where the number of the majority class samples (non-interacting residues) is far larger than that of the minority class samples (interacting residues). Thus, we developed a novel cascade random forests algorithm (CRF) to address the serious data imbalance that exists in the PPIs prediction problem. The proposed CRF resolves the negative effect of data imbalance by connecting multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples using an effective ensemble protocol. Based on the proposed CRF, we implemented a new sequence-based PPIs predictor, called CRF-PPI, which takes the combined features of position-specific scoring matrices, averaged cumulative hydropathy, and predicted relative solvent accessibility as model inputs. Benchmark experiments on both the cross validation and independent validation datasets demonstrated that the proposed CRF-PPI outperformed the state-of-the-art sequence-based PPIs predictors. The source code for CRF-PPI and the benchmark datasets are available online at http://csbio.njust.edu.cn/bioinf/CRF-PPI for free academic use.
Similar articles
-
Prediction of Protein-Protein Interaction Sites with Machine-Learning-Based Data-Cleaning and Post-Filtering Procedures.J Membr Biol. 2016 Apr;249(1-2):141-53. doi: 10.1007/s00232-015-9856-z. Epub 2015 Nov 12. J Membr Biol. 2016. PMID: 26563228
-
Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier.J Theor Biol. 2014 May 7;348:47-54. doi: 10.1016/j.jtbi.2014.01.028. Epub 2014 Jan 31. J Theor Biol. 2014. PMID: 24486250
-
A Sequence-Based Dynamic Ensemble Learning System for Protein Ligand-Binding Site Prediction.IEEE/ACM Trans Comput Biol Bioinform. 2016 Sep-Oct;13(5):901-912. doi: 10.1109/TCBB.2015.2505286. Epub 2015 Dec 3. IEEE/ACM Trans Comput Biol Bioinform. 2016. PMID: 26661785
-
Analyzing molecular interactions.Curr Protoc Bioinformatics. 2003 May;Chapter 8:Unit8.1. doi: 10.1002/0471250953.bi0801s01. Curr Protoc Bioinformatics. 2003. PMID: 18428708 Review.
-
Interaction-site prediction for protein complexes: a critical assessment.Bioinformatics. 2007 Sep 1;23(17):2203-9. doi: 10.1093/bioinformatics/btm323. Epub 2007 Jun 22. Bioinformatics. 2007. PMID: 17586545 Review.
Cited by
-
A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond.Brief Bioinform. 2024 Mar 27;25(3):bbae162. doi: 10.1093/bib/bbae162. Brief Bioinform. 2024. PMID: 38739759 Free PMC article. Review.
-
Protein-protein interaction site prediction by model ensembling with hybrid feature and self-attention.BMC Bioinformatics. 2023 Dec 5;24(1):456. doi: 10.1186/s12859-023-05592-7. BMC Bioinformatics. 2023. PMID: 38053020 Free PMC article.
-
PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy.Biomolecules. 2024 Sep 27;14(10):1220. doi: 10.3390/biom14101220. Biomolecules. 2024. PMID: 39456153 Free PMC article.
-
Exploring the computational methods for protein-ligand binding site prediction.Comput Struct Biotechnol J. 2020 Feb 17;18:417-426. doi: 10.1016/j.csbj.2020.02.008. eCollection 2020. Comput Struct Biotechnol J. 2020. PMID: 32140203 Free PMC article. Review.
-
SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences.Bioinformatics. 2019 Jul 15;35(14):i343-i353. doi: 10.1093/bioinformatics/btz324. Bioinformatics. 2019. PMID: 31510679 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources