Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May;54(5):765-776.
doi: 10.1007/s00726-022-03129-5. Epub 2022 Jan 30.

A two-step ensemble learning for predicting protein hot spot residues from whole protein sequence

Affiliations

A two-step ensemble learning for predicting protein hot spot residues from whole protein sequence

SiJie Yao et al. Amino Acids. 2022 May.

Abstract

Protein hot spot residues are functional sites in protein-protein interactions. Biological experimental methods are traditionally used to identify hot spot residues, which is laborious and time-consuming. Thus a variety of computational methods were widely used in recent years. Despite the success of computational methods in hot spot identification, most of them are impractical in reality because they can recognize hot spot residues only from known protein-protein interface residues. Therefore, identifying hot spots from whole protein sequence is a meaningful and interesting issue. However, it will bring extreme imbalance between positive and negative samples. Hot spot residues only account for about 1-2% of whole protein sequences. To address the issue, this paper proposes a two-step ensemble model for identifying hot spot residues from extremely unbalanced data set. The model is composed of 134 classifiers constructed by base KNN and SVM. Compared to the previous methods, our model yields good performance with an F1 score of 0.593 on the BID test set. Furthermore, to validate the robustness of our model, it was tested on other three independent test sets and also achieved good predictions. More importantly, the performance of our model tested on unbalanced data set is comparable with other methods tested on balanced hot spot data set.

Keywords: Ensemble learning; F1 score; Protein hot spot residues; Unbalanced data set.

PubMed Disclaimer

References

    1. Altschul S (1997) Gapped BLAST and PSI-BLAST : a new generation of protein database search programs. Nucleic Acids Res 25:3389 - DOI
    1. Chen P, Li J, Wong L et al (2013) Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences. Proteins Struct Funct Bioinform 81(8):1351–1362 - DOI
    1. Chothia C, Janin J (1975) Principles of protein–protein recognition. Nature 256(5520):705–708 - DOI
    1. Clackson T, Wells JA (1995) A hot spot of binding energy in a hormone-receptor interface. Science 267(5196):383–386 - DOI
    1. Claudio M, Porter GP (2013) PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility. Bioinformatics 16:16

LinkOut - more resources