HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Kai Zhao¹, Zhuocheng Ji¹, Linlin Zhang², Na Quan¹, Yuheng Li¹, Guanglei Yu^{3

4}, Xuehua Bi^{5

6}

Affiliations

¹ School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China.
² School of Software, Xinjiang University, Urumqi, 830011, China.
³ College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China.
⁴ School Of Computer Science and Engineering, Central South University, Changsha, 410083, China.
⁵ College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China. bxh0327@foxmail.com.
⁶ School Of Computer Science and Engineering, Central South University, Changsha, 410083, China. bxh0327@foxmail.com.

PMID: 40263997
PMCID: PMC12013097
DOI: 10.1186/s12859-025-06122-3

HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Kai Zhao et al. BMC Bioinformatics. 2025.

. 2025 Apr 22;26(1):110.

doi: 10.1186/s12859-025-06122-3.

Authors

Kai Zhao¹, Zhuocheng Ji¹, Linlin Zhang², Na Quan¹, Yuheng Li¹, Guanglei Yu^{3

4}, Xuehua Bi^{5

6}

Affiliations

¹ School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China.
² School of Software, Xinjiang University, Urumqi, 830011, China.
³ College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China.
⁴ School Of Computer Science and Engineering, Central South University, Changsha, 410083, China.
⁵ College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China. bxh0327@foxmail.com.
⁶ School Of Computer Science and Engineering, Central South University, Changsha, 410083, China. bxh0327@foxmail.com.

PMID: 40263997
PMCID: PMC12013097
DOI: 10.1186/s12859-025-06122-3

Abstract

Background: Understanding the relationships between proteins and specific disease phenotypes contributes to the early detection of diseases and advances the development of personalized medicine. The acquisition of a large amount of proteomics data has facilitated this process. To improve discovery efficiency and reduce the time and financial costs associated with biological experiments, various computational methods have yielded promising results. However, the lack of rich and reliable protein-related information still presents challenges in this process.

Results: In this paper, we propose an ensemble prediction model, named HPOseq, which predicts human protein-phenotype relationships based only on sequence information. HPOseq establishes two base models to achieve objectives. One directly extracts internal information from amino acid sequences as protein features to predict the associated phenotypes. The other builds a protein-protein network based on sequence similarity, extracting information between proteins for phenotype prediction. Ultimately, an ensemble module is employed to integrate the predictions from both base models, resulting in the final prediction.

Conclusion: The results of 5-fold cross-validation reveal that HPOseq outperforms seven baseline methods for predicting protein-phenotype relationships. Moreover, we conduct case studies from the points of phenotype annotation and protein analysis to verify the practical significance of HPOseq.

Keywords: Amino acid sequence; Deep learning; Ensemble model; Variational graph autoencoder.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Conflict of interest: The authors declare that they have no Conflict of interest.

Figures

**Fig. 1**
Framework of HPOseq. A Prediction based on intra-sequence features: based on amino acid sequence coding, intra-sequence features are first extracted using a multi-scale convolutional layer, followed by feature dimensionality reduction through pooling and spreading layers, and finally input into a fully connected layer to predict intra-sequence disease phenotypic association scores. B Prediction based on inter-sequence features: inter-sequence similarity among proteins is computed through the BLAST tool, and an attribute graph is constructed, followed by a variogram self-encoder to extract the node feature representations and input them into the fully connected neural network to predict the disease phenotype association scores between sequences. C Combination module: The fully connected neural network and mask matrix are used to fuse the prediction results under the intra- and inter-sequence feature sub-models to generate more accurate disease phenotype association scores

**Fig. 2**
The prediction performance of HPOseq is compared with that of the baseline method. The results show that HPOseq significantly outperforms the baseline model in two key metrics, AUPR and $F_{\max}$ , reaching 0.3244 and 0.3869, respectively. HPOseq extracts features from two dimensions, namely, the composition of the protein amino acid sequences and sequence similarity, to generate more comprehensive feature representations, which improves the prediction performance and robustness of the model

**Fig. 3**
Prediction performance of HPOseq with different embedding dimensions. a AUPR. b $F_{\max}$

**Fig. 4**
The performance of HPOseq under ablation study. a With the use of different modules. b With different similarity networks. c With different fusion methods

See this image and copyright information in PMC

References

1. Kulmanov M, Hoehndorf R. Deepgoplus: improved protein function prediction from sequence. Bioinformatics. 2020;36(2):422–9. - PMC - PubMed
1. Bao W, Yang B. Protein acetylation sites with complex-valued polynomial model. Front Comput Sci. 2024;18: 183904.
1. Bao W, Liu Y, Chen B. Oral_voting_transfer: classification of oral microorganisms’ function proteins with voting transfer model. Front Microbiol. 2024;14:1277121. - PMC - PubMed
1. Bao W, Chen B, Zhang Y. WSHNN: A weakly supervised hybrid neural network for the identification of DNA-protein binding sites. Curr Comput-aided Drug Design. 2024. 10.2174/0115734099277249240129114123. - PubMed
1. Yu G, Zhang L, Zhang Y, Zhou J, Zhang T, Bi X. Prediction and risk stratification from hospital discharge records based on hierarchical sLDA. BMC Med Inform Decis Mak. 2022;22(1):1–12. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Affiliations

HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources