A two-step ensemble learning for predicting protein hot spot residues from whole protein sequence
- PMID: 35098379
- DOI: 10.1007/s00726-022-03129-5
A two-step ensemble learning for predicting protein hot spot residues from whole protein sequence
Abstract
Protein hot spot residues are functional sites in protein-protein interactions. Biological experimental methods are traditionally used to identify hot spot residues, which is laborious and time-consuming. Thus a variety of computational methods were widely used in recent years. Despite the success of computational methods in hot spot identification, most of them are impractical in reality because they can recognize hot spot residues only from known protein-protein interface residues. Therefore, identifying hot spots from whole protein sequence is a meaningful and interesting issue. However, it will bring extreme imbalance between positive and negative samples. Hot spot residues only account for about 1-2% of whole protein sequences. To address the issue, this paper proposes a two-step ensemble model for identifying hot spot residues from extremely unbalanced data set. The model is composed of 134 classifiers constructed by base KNN and SVM. Compared to the previous methods, our model yields good performance with an F1 score of 0.593 on the BID test set. Furthermore, to validate the robustness of our model, it was tested on other three independent test sets and also achieved good predictions. More importantly, the performance of our model tested on unbalanced data set is comparable with other methods tested on balanced hot spot data set.
Keywords: Ensemble learning; F1 score; Protein hot spot residues; Unbalanced data set.
© 2022. The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature.
References
-
- Altschul S (1997) Gapped BLAST and PSI-BLAST : a new generation of protein database search programs. Nucleic Acids Res 25:3389 - DOI
-
- Chen P, Li J, Wong L et al (2013) Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences. Proteins Struct Funct Bioinform 81(8):1351–1362 - DOI
-
- Chothia C, Janin J (1975) Principles of protein–protein recognition. Nature 256(5520):705–708 - DOI
-
- Clackson T, Wells JA (1995) A hot spot of binding energy in a hormone-receptor interface. Science 267(5196):383–386 - DOI
-
- Claudio M, Porter GP (2013) PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility. Bioinformatics 16:16
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources