. 2018 Sep 1;34(17):i901-i907.

doi: 10.1093/bioinformatics/bty559.

Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes

Mona Alshahrani¹, Robert Hoehndorf¹

Affiliations

Affiliation

¹ Computer, Electrical and Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

PMID: 30423077
PMCID: PMC6129260
DOI: 10.1093/bioinformatics/bty559

Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes

Mona Alshahrani et al. Bioinformatics. 2018.

. 2018 Sep 1;34(17):i901-i907.

doi: 10.1093/bioinformatics/bty559.

Authors

Mona Alshahrani¹, Robert Hoehndorf¹

Affiliation

¹ Computer, Electrical and Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

PMID: 30423077
PMCID: PMC6129260
DOI: 10.1093/bioinformatics/bty559

Abstract

Motivation: In the past years, several methods have been developed to incorporate information about phenotypes into computational disease gene prioritization methods. These methods commonly compute the similarity between a disease's (or patient's) phenotypes and a database of gene-to-phenotype associations to find the phenotypically most similar match. A key limitation of these methods is their reliance on knowledge about phenotypes associated with particular genes which is highly incomplete in humans as well as in many model organisms such as the mouse.

Results: We developed SmuDGE, a method that uses feature learning to generate vector-based representations of phenotypes associated with an entity. SmuDGE can be used as a trainable semantic similarity measure to compare two sets of phenotypes (such as between a disease and gene, or a disease and patient). More importantly, SmuDGE can generate phenotype representations for entities that are only indirectly associated with phenotypes through an interaction network; for this purpose, SmuDGE exploits background knowledge in interaction networks comprised of multiple types of interactions. We demonstrate that SmuDGE can match or outperform semantic similarity in phenotype-based disease gene prioritization, and furthermore significantly extends the coverage of phenotype-based methods to all genes in a connected interaction network.

Availability and implementation: https://github.com/bio-ontology-research-group/SmuDGE.

PubMed Disclaimer

Figures

**Fig. 1.**
Our knowledge graph consists of gene–phenotype associations (encoded using either HPO or MP), disease–phenotype associations (encoded using the HPO), interactions between genes (from the STRING database) and the PhenomeNET ontology

**Fig. 2.**
Overview over SmuDGE and its applications. (a) On the left side we show the graph of disease–phenotype and gene–phenotype associations together with the PhenomeNET ontology (top), and the same graph including interactions between genes used to generate E-Vecs at the bottom. We generate a corpus by graph traversal and then use a skipgram model to generate vectors for genes and gene products in the graph. These vectors can be used as input to a similarity measure, or a neural network, to predict interactions between genes and diseases. (b) Our ANN model is shown in the center; the input is the pair of disease and gene feature vectors of dimension x, the first hidden layer consists of 2x hidden units, and the second hidden layer consist of x hidden units; we use a dropout of 0.5 to mitigate the effects of overfitting. (c) We evaluate the model by predicting candidate genes for each disease and rank each gene for each disease based on the ANN’s prediction score

**Fig. 3.**
ROC curves for predicting gene–disease associations using cosine similarity between SmuDGE’s P-Vecs and comparison to Resnik’s semantic similarity measure

**Fig. 4.**
Comparision of ROC curves for predicting gene–disease associations based on mouse phenotypes using SmuDGE’s feature vectors and comparison to the Resnik and simGIC semantic similarity measures

**Fig. 5.**
ROC curves for predicting gene–disease associations for diseases with a single or multiple associated genes using SmuDGE’s P-Vec approach

**Fig. 6.**
ROC cuves for predicting gene–disease associations for diseases with a single or multiple associated genes using SmuDGE’s E-Vec approach

See this image and copyright information in PMC

References

1. Aerts S. (2006) Gene prioritization through genomic data fusion. Nat. Biotechnol., 24, 537–544. - PubMed
1. Alshahrani M. et al. (2017) Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics, 33, 2723–2730. - PMC - PubMed
1. Amberger J. et al. (2011) A new face and new challenges for Online Mendelian Inheritance in Man (OMIM). Hum. Mutat., 32, 564–567. - PubMed
1. Blake J.A. et al. (2014) The mouse genome database: integration of and access to knowledge about the laboratory mouse. Nucleic Acids Res., 42, D810–D817. DOI: 10.1093/nar/gkt1225. - PMC - PubMed
1. Boudellioua I. et al. (2017) Semantic prioritization of novel causative genomic variants. PLoS Comput. Biol., 13, e1005500.. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes

Affiliation

Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous