Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 22;26(1):110.
doi: 10.1186/s12859-025-06122-3.

HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Affiliations

HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences

Kai Zhao et al. BMC Bioinformatics. .

Abstract

Background: Understanding the relationships between proteins and specific disease phenotypes contributes to the early detection of diseases and advances the development of personalized medicine. The acquisition of a large amount of proteomics data has facilitated this process. To improve discovery efficiency and reduce the time and financial costs associated with biological experiments, various computational methods have yielded promising results. However, the lack of rich and reliable protein-related information still presents challenges in this process.

Results: In this paper, we propose an ensemble prediction model, named HPOseq, which predicts human protein-phenotype relationships based only on sequence information. HPOseq establishes two base models to achieve objectives. One directly extracts internal information from amino acid sequences as protein features to predict the associated phenotypes. The other builds a protein-protein network based on sequence similarity, extracting information between proteins for phenotype prediction. Ultimately, an ensemble module is employed to integrate the predictions from both base models, resulting in the final prediction.

Conclusion: The results of 5-fold cross-validation reveal that HPOseq outperforms seven baseline methods for predicting protein-phenotype relationships. Moreover, we conduct case studies from the points of phenotype annotation and protein analysis to verify the practical significance of HPOseq.

Keywords: Amino acid sequence; Deep learning; Ensemble model; Variational graph autoencoder.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Conflict of interest: The authors declare that they have no Conflict of interest.

Figures

Fig. 1
Fig. 1
Framework of HPOseq. A Prediction based on intra-sequence features: based on amino acid sequence coding, intra-sequence features are first extracted using a multi-scale convolutional layer, followed by feature dimensionality reduction through pooling and spreading layers, and finally input into a fully connected layer to predict intra-sequence disease phenotypic association scores. B Prediction based on inter-sequence features: inter-sequence similarity among proteins is computed through the BLAST tool, and an attribute graph is constructed, followed by a variogram self-encoder to extract the node feature representations and input them into the fully connected neural network to predict the disease phenotype association scores between sequences. C Combination module: The fully connected neural network and mask matrix are used to fuse the prediction results under the intra- and inter-sequence feature sub-models to generate more accurate disease phenotype association scores
Fig. 2
Fig. 2
The prediction performance of HPOseq is compared with that of the baseline method. The results show that HPOseq significantly outperforms the baseline model in two key metrics, AUPR and Fmax, reaching 0.3244 and 0.3869, respectively. HPOseq extracts features from two dimensions, namely, the composition of the protein amino acid sequences and sequence similarity, to generate more comprehensive feature representations, which improves the prediction performance and robustness of the model
Fig. 3
Fig. 3
Prediction performance of HPOseq with different embedding dimensions. a AUPR. b Fmax
Fig. 4
Fig. 4
The performance of HPOseq under ablation study. a With the use of different modules. b With different similarity networks. c With different fusion methods

References

    1. Kulmanov M, Hoehndorf R. Deepgoplus: improved protein function prediction from sequence. Bioinformatics. 2020;36(2):422–9. - PMC - PubMed
    1. Bao W, Yang B. Protein acetylation sites with complex-valued polynomial model. Front Comput Sci. 2024;18: 183904.
    1. Bao W, Liu Y, Chen B. Oral_voting_transfer: classification of oral microorganisms’ function proteins with voting transfer model. Front Microbiol. 2024;14:1277121. - PMC - PubMed
    1. Bao W, Chen B, Zhang Y. WSHNN: A weakly supervised hybrid neural network for the identification of DNA-protein binding sites. Curr Comput-aided Drug Design. 2024. 10.2174/0115734099277249240129114123. - PubMed
    1. Yu G, Zhang L, Zhang Y, Zhou J, Zhang T, Bi X. Prediction and risk stratification from hospital discharge records based on hierarchical sLDA. BMC Med Inform Decis Mak. 2022;22(1):1–12. - PMC - PubMed

LinkOut - more resources