An Efficient Computational Model for Large-Scale Prediction of Protein-Protein Interactions Based on Accurate and Scalable Graph Embedding

Xiao-Rui Su^{1

2

3}, Zhu-Hong You^{1

2

3}, Lun Hu^{1

2

3}, Yu-An Huang¹, Yi Wang^{1

2

3}, Hai-Cheng Yi^{1

2

3}

Affiliations

¹ Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China.
² University of Chinese Academy of Sciences, Beijing, China.
³ Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China.

PMID: 33719344
PMCID: PMC7953052
DOI: 10.3389/fgene.2021.635451

An Efficient Computational Model for Large-Scale Prediction of Protein-Protein Interactions Based on Accurate and Scalable Graph Embedding

Xiao-Rui Su et al. Front Genet. 2021.

. 2021 Feb 26:12:635451.

doi: 10.3389/fgene.2021.635451. eCollection 2021.

Authors

Xiao-Rui Su^{1

2

3}, Zhu-Hong You^{1

2

3}, Lun Hu^{1

2

3}, Yu-An Huang¹, Yi Wang^{1

2

3}, Hai-Cheng Yi^{1

2

3}

Affiliations

¹ Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China.
² University of Chinese Academy of Sciences, Beijing, China.
³ Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China.

PMID: 33719344
PMCID: PMC7953052
DOI: 10.3389/fgene.2021.635451

Abstract

Protein-protein interaction (PPI) is the basis of the whole molecular mechanisms of living cells. Although traditional experiments are able to detect PPIs accurately, they often encounter high cost and require more time. As a result, computational methods have been used to predict PPIs to avoid these problems. Graph structure, as the important and pervasive data carriers, is considered as the most suitable structure to present biomedical entities and relationships. Although graph embedding is the most popular approach for graph representation learning, it usually suffers from high computational and space cost, especially in large-scale graphs. Therefore, developing a framework, which can accelerate graph embedding and improve the accuracy of embedding results, is important to large-scale PPIs prediction. In this paper, we propose a multi-level model LPPI to improve both the quality and speed of large-scale PPIs prediction. Firstly, protein basic information is collected as its attribute, including positional gene sets, motif gene sets, and immunological signatures. Secondly, we construct a weighted graph by using protein attributes to calculate node similarity. Then GraphZoom is used to accelerate the embedding process by reducing the size of the weighted graph. Next, graph embedding methods are used to learn graph topology features from the reconstructed graph. Finally, the linear Logistic Regression (LR) model is used to predict the probability of interactions of two proteins. LPPI achieved a high accuracy of 0.99997 and 0.9979 on the PPI network dataset and GraphSAGE-PPI dataset, respectively. Our further results show that the LPPI is promising for large-scale PPI prediction in both accuracy and efficiency, which is beneficial to other large-scale biomedical molecules interactions detection.

Keywords: GraphZoom; graph embedding; large-scale; protein-protein interaction; weighted graph.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
The overview of the proposed model.

**Figure 2**
Timing experiments of four parts on PPI network dataset and GraphSAGE-PPI dataset.

**Figure 3**
Timing experiments of different embedding methods on PPI network dataset and GraphSAGE-PPI dataset.

**Figure 4**
Accuracy and timing experiments on two benchmark datasets. **(A)** Model performance with respect to the coarsening level on PPI network dataset. **(B)** Model performance with respect to the coarsening level on the GraphSAGE-PPI dataset. **(C)** Model performance about fusion parameter on PPI network dataset. **(D)** Model performance about fusion parameter on the GraphSAGE-PPI dataset.

**Figure 5**
(A) The change of link number and node number with the coarsening level increasing on the PPI network dataset. (B) The change of link number and node number with the coarsening level increasing on the GraphSAGE-PPI dataset.

See this image and copyright information in PMC

Cited by

Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.
Asim MN, Asif T, Hassan F, Dengel A. Asim MN, et al. Database (Oxford). 2025 May 30;2025:baaf027. doi: 10.1093/database/baaf027. Database (Oxford). 2025. PMID: 40448683 Free PMC article.
A multi-source molecular network representation model for protein-protein interactions prediction.
Zou HT, Ji BY, Xie XL. Zou HT, et al. Sci Rep. 2024 Mar 14;14(1):6184. doi: 10.1038/s41598-024-56286-w. Sci Rep. 2024. PMID: 38485942 Free PMC article.
An Ensemble Classifiers for Improved Prediction of Native-Non-Native Protein-Protein Interaction.
Pratiwi NKC, Tayara H, Chong KT. Pratiwi NKC, et al. Int J Mol Sci. 2024 May 29;25(11):5957. doi: 10.3390/ijms25115957. Int J Mol Sci. 2024. PMID: 38892144 Free PMC article.
Multi-view heterogeneous molecular network representation learning for protein-protein interaction prediction.
Su XR, Hu L, You ZH, Hu PW, Zhao BW. Su XR, et al. BMC Bioinformatics. 2022 Jun 16;23(1):234. doi: 10.1186/s12859-022-04766-z. BMC Bioinformatics. 2022. PMID: 35710342 Free PMC article.
Graph embedding on mass spectrometry- and sequencing-based biomedical data.
Alvarez-Mamani E, Dechant R, Beltran-Castañón CA, Ibáñez AJ. Alvarez-Mamani E, et al. BMC Bioinformatics. 2024 Jan 2;25(1):1. doi: 10.1186/s12859-023-05612-6. BMC Bioinformatics. 2024. PMID: 38166530 Free PMC article. Review.

References

1. Belkin M., Niyogi P. (2003). Laplacian eigenmaps for dimensionality reduction and data. Neural Comput. 15, 1373–1396. 10.1162/089976603321780317 - DOI
1. Chen K. -H., Wang T. -F., Hu Y. -J. (2019). Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme. BMC Bioinformatics 20:308. 10.1186/s12859-019-2907-1, PMID: - DOI - PMC - PubMed
1. Deng C., Zhao Z., Wang Y., Zhang Z., Feng Z. (2019). ‘GraphZoom: a multi-level spectral approach for accurate and scalable graph embedding.’ Comput. Sci. [Preprint].
1. Gavin A. -C., Bösche M., Krause R., Grandi P., Marzioch M., Bauer A., et al. . (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147. 10.1038/415141a, PMID: - DOI - PubMed
1. Grover A., Leskovec J. (2016). “node2vec: Scalable feature learning for networks” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge 1117 discovery and data mining (ACM); August 13–17, 2016; 855–864. - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An Efficient Computational Model for Large-Scale Prediction of Protein-Protein Interactions Based on Accurate and Scalable Graph Embedding

Affiliations

An Efficient Computational Model for Large-Scale Prediction of Protein-Protein Interactions Based on Accurate and Scalable Graph Embedding

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials