. 2023 Jan 19;6(1):73.

doi: 10.1038/s42003-023-04462-5.

Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning

Zilong Hou^#¹, Yuning Yang^#², Zhiqiang Ma², Ka-Chun Wong³, Xiangtao Li⁴

Affiliations

¹ School of Artificial Intelligence, Jilin University, Jilin, China.
² Information Science and Technology, Northeast Normal University, Jilin, China.
³ Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China.
⁴ School of Artificial Intelligence, Jilin University, Jilin, China. lixt314@jlu.edu.cn.

^# Contributed equally.

PMID: 36653447
PMCID: PMC9849350
DOI: 10.1038/s42003-023-04462-5

Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning

Zilong Hou et al. Commun Biol. 2023.

. 2023 Jan 19;6(1):73.

doi: 10.1038/s42003-023-04462-5.

Authors

Zilong Hou^#¹, Yuning Yang^#², Zhiqiang Ma², Ka-Chun Wong³, Xiangtao Li⁴

Affiliations

¹ School of Artificial Intelligence, Jilin University, Jilin, China.
² Information Science and Technology, Northeast Normal University, Jilin, China.
³ Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China.
⁴ School of Artificial Intelligence, Jilin University, Jilin, China. lixt314@jlu.edu.cn.

^# Contributed equally.

PMID: 36653447
PMCID: PMC9849350
DOI: 10.1038/s42003-023-04462-5

Abstract

Protein-protein interactions (PPIs) govern cellular pathways and processes, by significantly influencing the functional expression of proteins. Therefore, accurate identification of protein-protein interaction binding sites has become a key step in the functional analysis of proteins. However, since most computational methods are designed based on biological features, there are no available protein language models to directly encode amino acid sequences into distributed vector representations to model their characteristics for protein-protein binding events. Moreover, the number of experimentally detected protein interaction sites is much smaller than that of protein-protein interactions or protein sites in protein complexes, resulting in unbalanced data sets that leave room for improvement in their performance. To address these problems, we develop an ensemble deep learning model (EDLM)-based protein-protein interaction (PPI) site identification method (EDLMPPI). Evaluation results show that EDLMPPI outperforms state-of-the-art techniques including several PPI site prediction models on three widely-used benchmark datasets including Dset_448, Dset_72, and Dset_164, which demonstrated that EDLMPPI is superior to those PPI site prediction models by nearly 10% in terms of average precision. In addition, the biological and interpretable analyses provide new insights into protein binding site identification and characterization mechanisms from different perspectives. The EDLMPPI webserver is available at http://www.edlmppi.top:5002/ .

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1. Overview of the proposed method, an ensemble deep learning model (EDLMPPI)-based protein–protein interaction site identifier consisting of two main components: Bi-directional Long Short-Term Memory (BiLSTM) for extracting long-range dependencies of features and capsule network for exploring the intrinsic association between features and preserving inter-sample location information.
On the one hand, this design can capture the correlation between features in both directions and fully considers the contextual information. On the other hand, the capsule can retain key information as much as possible while reducing the dimensionality of features, avoiding information leakage, and improving the efficiency of the algorithm.

**Fig. 2. Experimental results are presented to reveal the effectiveness of the model.**
a Radar chart of evaluation indicators corresponding to the different window sizes. b Showing the performance comparison of ProtT5, MBF, and combined features on the classifier, where the “Average evaluation metric values” refers to the average of the eight evaluation metrics (including TPR, TNR, Pre, ACC, F1, MCC, AUROC, and AP) for the different feature descriptors on these three datasets. c Demonstrating the performance comparison between the EDLMPPI architecture and 10 mainstream machine learning models and deep learning models: EDLMPPI is particularly strong in key metrics. d Performance comparison between different methods for imbalance dataset resolution, where the “Average evaluation metric values” refers to the average of the eight evaluation metrics (including TPR, TNR, Pre, ACC, F1, MCC, AUROC, and AP) for the different algorithms on these three datasets.

**Fig. 3. Display of the results of comparative experiments and biological analysis experiments.**
a Demonstrating the results of comparisons between EDLMPPI and ten other competitive methods, with the “Average evaluation metric values” referring to the average of the eight evaluation metrics (including TPR, TNR, Pre, ACC, F1, MCC, AUROC, and AP) for the different methods on these three datasets. b A comparison of the predicted PPIs from EDLMPPI, DELPHI, and SCRIBER compared to native PPIs. By calculating the proportion of PPIs in each domain, EDLMPPI and native PPIs have the highest correlation.

**Fig. 4. Presentation of the results of the interpretability analysis experiment.**
a The t-SNE flow graph shows the clustering effect of the output of the different intermediate layers of the EDLMPPI architecture. b The 20 features that have the greatest impact on PPIs identification, revealing how they act for predicting non-binding sites and bindings sites, respectively. c The schematic diagrams show the interaction between feature 1024 and other features, and the interaction between feature 569 and other features, respectively. d A stacked diagram showing the effect of each feature on each sample.

**Fig. 5. Presentation of the results of the interpretability analysis experiment.**
a Correlation heat map of each residue under ProtT5 embedding. b Attention view with different layers and different attention heads. c Attention flow view between different layers, with each color representing a different layer.

See this image and copyright information in PMC

References

1. Titeca K, Lemmens I, Tavernier J, Eyckerman S. Discovering cellular protein-protein interactions: technological strategies and opportunities. Mass Spectrom. Rev. 2019;38:79–111. doi: 10.1002/mas.21574. - DOI - PubMed
1. Yang J, Roy A, Zhang Y. BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res. 2012;41:D1096–D1103. doi: 10.1093/nar/gks966. - DOI - PMC - PubMed
1. Berman HM, et al. The protein data bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. - DOI - PMC - PubMed
1. Zhang J, Kurgan L. Review and comparative assessment of sequence-based predictors of protein-binding residues. Brief. Bioinforma. 2018;19:821–837. doi: 10.1093/bib/bbx022. - DOI - PubMed
1. Drewes G, Bouwmeester T. Global approaches to protein–protein interactions. Curr. Opin. Cell Biol. 2003;15:199–205. doi: 10.1016/S0955-0674(03)00005-X. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning

Affiliations

Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources