Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep 1;34(17):i901-i907.
doi: 10.1093/bioinformatics/bty559.

Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes

Affiliations

Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes

Mona Alshahrani et al. Bioinformatics. .

Abstract

Motivation: In the past years, several methods have been developed to incorporate information about phenotypes into computational disease gene prioritization methods. These methods commonly compute the similarity between a disease's (or patient's) phenotypes and a database of gene-to-phenotype associations to find the phenotypically most similar match. A key limitation of these methods is their reliance on knowledge about phenotypes associated with particular genes which is highly incomplete in humans as well as in many model organisms such as the mouse.

Results: We developed SmuDGE, a method that uses feature learning to generate vector-based representations of phenotypes associated with an entity. SmuDGE can be used as a trainable semantic similarity measure to compare two sets of phenotypes (such as between a disease and gene, or a disease and patient). More importantly, SmuDGE can generate phenotype representations for entities that are only indirectly associated with phenotypes through an interaction network; for this purpose, SmuDGE exploits background knowledge in interaction networks comprised of multiple types of interactions. We demonstrate that SmuDGE can match or outperform semantic similarity in phenotype-based disease gene prioritization, and furthermore significantly extends the coverage of phenotype-based methods to all genes in a connected interaction network.

Availability and implementation: https://github.com/bio-ontology-research-group/SmuDGE.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Our knowledge graph consists of gene–phenotype associations (encoded using either HPO or MP), disease–phenotype associations (encoded using the HPO), interactions between genes (from the STRING database) and the PhenomeNET ontology
Fig. 2.
Fig. 2.
Overview over SmuDGE and its applications. (a) On the left side we show the graph of disease–phenotype and gene–phenotype associations together with the PhenomeNET ontology (top), and the same graph including interactions between genes used to generate E-Vecs at the bottom. We generate a corpus by graph traversal and then use a skipgram model to generate vectors for genes and gene products in the graph. These vectors can be used as input to a similarity measure, or a neural network, to predict interactions between genes and diseases. (b) Our ANN model is shown in the center; the input is the pair of disease and gene feature vectors of dimension x, the first hidden layer consists of 2x hidden units, and the second hidden layer consist of x hidden units; we use a dropout of 0.5 to mitigate the effects of overfitting. (c) We evaluate the model by predicting candidate genes for each disease and rank each gene for each disease based on the ANN’s prediction score
Fig. 3.
Fig. 3.
ROC curves for predicting gene–disease associations using cosine similarity between SmuDGE’s P-Vecs and comparison to Resnik’s semantic similarity measure
Fig. 4.
Fig. 4.
Comparision of ROC curves for predicting gene–disease associations based on mouse phenotypes using SmuDGE’s feature vectors and comparison to the Resnik and simGIC semantic similarity measures
Fig. 5.
Fig. 5.
ROC curves for predicting gene–disease associations for diseases with a single or multiple associated genes using SmuDGE’s P-Vec approach
Fig. 6.
Fig. 6.
ROC cuves for predicting gene–disease associations for diseases with a single or multiple associated genes using SmuDGE’s E-Vec approach

References

    1. Aerts S. (2006) Gene prioritization through genomic data fusion. Nat. Biotechnol., 24, 537–544. - PubMed
    1. Alshahrani M. et al. (2017) Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics, 33, 2723–2730. - PMC - PubMed
    1. Amberger J. et al. (2011) A new face and new challenges for Online Mendelian Inheritance in Man (OMIM). Hum. Mutat., 32, 564–567. - PubMed
    1. Blake J.A. et al. (2014) The mouse genome database: integration of and access to knowledge about the laboratory mouse. Nucleic Acids Res., 42, D810–D817. DOI: 10.1093/nar/gkt1225. - PMC - PubMed
    1. Boudellioua I. et al. (2017) Semantic prioritization of novel causative genomic variants. PLoS Comput. Biol., 13, e1005500.. - PMC - PubMed

Publication types