Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 10;11(16):2485.
doi: 10.3390/cells11162485.

Integration of Human Protein Sequence and Protein-Protein Interaction Data by Graph Autoencoder to Identify Novel Protein-Abnormal Phenotype Associations

Affiliations

Integration of Human Protein Sequence and Protein-Protein Interaction Data by Graph Autoencoder to Identify Novel Protein-Abnormal Phenotype Associations

Yuan Liu et al. Cells. .

Abstract

Understanding gene functions and their associated abnormal phenotypes is crucial in the prevention, diagnosis and treatment against diseases. The Human Phenotype Ontology (HPO) is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. However, the current HPO annotations are far from completion, and only a small fraction of human protein-coding genes has HPO annotations. Thus, it is necessary to predict protein-phenotype associations using computational methods. Protein sequences can indicate the structure and function of the proteins, and interacting proteins are more likely to have same function. It is promising to integrate these features for predicting HPO annotations of human protein. We developed GraphPheno, a semi-supervised method based on graph autoencoders, which does not require feature engineering to capture deep features from protein sequences, while also taking into account the topological properties in the protein-protein interaction network to predict the relationships between human genes/proteins and abnormal phenotypes. Cross validation and independent dataset tests show that GraphPheno has satisfactory prediction performance. The algorithm is further confirmed on automatic HPO annotation for no-knowledge proteins under the benchmark of the second Critical Assessment of Functional Annotation, 2013-2014 (CAFA2), where GraphPheno surpasses most existing methods. Further bioinformatics analysis shows that predicted certain phenotype-associated genes using GraphPheno share similar biological properties with known ones. In a case study on the phenotype of abnormality of mitochondrial respiratory chain, top prioritized genes are validated by recent papers. We believe that GraphPheno will help to reveal more associations between genes and phenotypes, and contribute to the discovery of drug targets.

Keywords: deep learning; graph autoencoder; protein-phenotype associations prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Overview of GraphPheno. The model consists of three modules: (A) the Protein embedding module: this module takes proteins interaction and sequence information as input. PPI network was converted to the format of adjacency matrix A. Proteins amino acid sequences were embedded using the conjoint triad (CT) method and served as proteins initial features X; (B) the VGAE module for feature embedding: This module consists of a two-layer GCN encoder and a dot product decoder, and generates latent representations Z based on both topological information from PPI network and protein sequence features. The adjacency matrix A˜ is reconstructed using the latent variable Z through the dot product decoder. (C) Neural network module for prediction: Gold standard dataset is used to train this module. This module takes VGAE embedding as an input to produce prediction scores for each gene-phenotype association.
Figure 2
Figure 2
Performance evaluation of GraphPheno. (A,B) Performance comparison of various prediction models in term-centric evaluation (A) and protein-centric evaluation (B). Confidence intervals (95%) were determined using bootstrapping with 100 iterations. (C) The distributions of AUCs for the prediction of 3741 HPO terms by GraphPheno model against five-fold cross-validation prediction. (D) The distributions of AUCs for the prediction of 2993 HPO terms by GraphPheno model against independent test set validation. The mean value of AUCs are plotted in dotted lines.
Figure 3
Figure 3
Performance comparison under the benchmark of the CAFA2 challenge using F-max. GraphPheno (green) was compared with the top performing CAFA2 participating methods (light gray), baseline methods in CAFA2 challenge (red for Naive, and blue for BLAST), and several HPO predicting methods proposed after CAFA2 challenge (dark gray). F-max is the maximum value of F-measure over all thresholds. Confidence intervals (95%) were determined using bootstrapping with 100 iterations.
Figure 4
Figure 4
Predicted and known annotated genes of 4369 phenotypes share similar biological properties. Box plots of the sequence consistency (A), number of protein–protein interactions (B), gene expression correlation coefficient (C), number of proteins in the smallest shared biological process (D) between predicted genes and random genes with known annotated genes for each phenotype. The random genes were randomly selected from unannotated genes of each phenotype with an equal number of predicted genes. (In the box plots, the middle bar represents the median, and the box represents the interquartile range; bars extend to 1.5× the interquartile range. p-values are calculated by the Student’s t-test and shown on the top of the boxes. *** p-value < 0.001, **** p-value < 0.0001).
Figure 5
Figure 5
Functional analysis of predicted and known annotated genes with the phenotype of Decreased activity of mitochondrial respiratory chain. (A) Overlap between predicted and known annotated “Decreased activity of mitochondrial respiratory chain”-associated genes with respect to enriched GO terms and KEGG pathways. A relatively large shared GO enrichment terms and KEGG enrichment pathways was found. (*** p-value < 0.001, hypergeometric test). Predicted “Decreased activity of mitochondrial respiratory chain”-associated genes are enriched in mitochondrial related GO terms (B) and Mitochondrial related diseases such as Alzheimer’s disease, Parkinson’s disease, etc. (C).
Figure 6
Figure 6
Both predicted and known annotated “Decreased activity of mitochondrial respiratory chain”-associated genes tend to be significantly down regulated in multiple neurodegenerative diseases. Enrichment ratio was calculated as the GeneRatio divided by Background Ratio. GeneRatio refers to the number of predicted or known annotated “Decreased activity of mitochondrial respiratory chain”-associated genes which are significantly down regulated in the GEO dataset divided by the total number of predicted or known annotated “Decreased activity of mitochondrial respiratory chain”-associated genes. Background Ratio refers to the number of significantly down regulated genes in the GEO dataset divided by the total number of genes identified in the GEO dataset. (ns: p-value > 0.05, *** p-value < 0.001, hypergeometric test).
Figure 7
Figure 7
Biological insight into the predicted “Decreased activity of mitochondrial respiratory chain”-associated genes. The oxidative phosphorylation (OXPHOS) system is embedded in the lipid bilayer of the inner mitochondrial membrane (IMM) and is composed of five protein enzyme complexes and two mobile electron carriers namely ubiquinone (CoQ) and cytochrome c (Cyt C). Translocator of the outer and inner mitochondrial membrane (TOM and TIM, respectively) were also shown. Predicted “Decreased activity of mitochondrial respiratory chain”-associated genes were presented in dotted boxes, in which circle and rectangle denote genes which function as the subunits and assemble factors of mitochondrial respiratory chain complex I–V, TOM40 complex and TIM23 complex, respectively. Predicted “Decreased activity of mitochondrial respiratory chain”-related genes validated by recent papers or the newest version of HPO database were marked with asterisks.

References

    1. Kohler S., Doelken S.C., Mungall C.J., Bauer S., Firth H.V., Bailleul-Forestier I., Black G.C., Brown D.L., Brudno M., Campbell J. The Human Phenotype Ontology project: Linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42:D966–D974. doi: 10.1093/nar/gkt1026. - DOI - PMC - PubMed
    1. Kohler S., Gargano M., Matentzoglu N., Carmody L.C., Lewis-Smith D., Vasilevsky N.A., Danis D., Balagura G., Baynam G., Brower A.M. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 2021;49:D1207–D1217. doi: 10.1093/nar/gkaa1043. - DOI - PMC - PubMed
    1. Radivojac P., Clark W.T., Oron T.R., Schnoes A.M., Wittkop T., Sokolov A., Graim K., Funk C., Verspoor K., Ben-Hur A. A large-scale evaluation of computational protein function prediction. Nat. Methods. 2013;10:221–227. doi: 10.1038/nmeth.2340. - DOI - PMC - PubMed
    1. Jiang Y., Oron T.R., Clark W.T., Bankapur A.R., D’Andrea D., Lepore R., Funk C.S., Kahanda I., Verspoor K.M., Ben-Hur A. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016;17:184. doi: 10.1186/s13059-016-1037-6. - DOI - PMC - PubMed
    1. Zhou N., Jiang Y., Bergquist T.R., Lee A.J., Kacsoh B.Z., Crocker A.W., Lewis K.A., Georghiou G., Nguyen H.N., Hamid M.N. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019;20:244. doi: 10.1186/s13059-019-1835-8. - DOI - PMC - PubMed

Publication types

LinkOut - more resources