Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jan 7:2023.01.05.522941.
doi: 10.1101/2023.01.05.522941.

Hetnet connectivity search provides rapid insights into how two biomedical entities are related

Affiliations

Hetnet connectivity search provides rapid insights into how two biomedical entities are related

Daniel S Himmelstein et al. bioRxiv. .

Update in

Abstract

Hetnets, short for "heterogeneous networks", contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes - including genes, diseases, drugs, pathways, and anatomical structures - with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search . We provide an open source implementation of these methods in our new Python package named hetmatpy .

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. A. Hetionet v1.0 metagraph.
The types of nodes and edges in Hetionet. B. Supervised machine learning approach from Project Rephetio. This figure visualizes the feature matrix used by Project Rephetio to make supervised predictions. Each row represents a compound–disease pair. The top half of rows correspond to known treatments (i.e. positives), while the bottom half correspond to non-treatments (i.e. negatives under a closed-world assumption, not known to be treatments in PharmacotherapyDB). Here, an equal number of treatments and non-treatments are shown, but in reality the problem is heavily imbalanced. Project Rephetio scaled models to assume a positive prevalence of 0.36% [2,4]. Each column represents a metapath, labeled with its abbreviation. Feature values are DWPCs (transformed and standardized), which assess the connectivity along the specified metapath between the specific compound and disease. Green colored values indicate above-average connectivity, whereas blue values indicate below average connectivity. In general, positives have greater connectivity for the selected metapaths than negatives. Rephetio used a logistic regression model to learn the effect of each type of connectivity (feature) on the likelihood that a compound treats a disease. The model predicts whether a compound–disease pair is a treatment based on its features, but requires supervision in the form of known treatments.
Figure 2:
Figure 2:. Expanded metapath details from the connectivity search webapp.
This is the expanded view of the metapath table in 4B.
Figure 3:
Figure 3:. Homepage of the Hetio website.
Provides a succinct overview of what Hetionet consists of and what its purpose is.
Figure 4:
Figure 4:. Using the connectivity search webapp to explore the pathophysiology of Alzheimer’s disease.
This figure shows an example user workflow for https://het.io/search/. A. The user selects two nodes. Here, the user is interested in Alzheimer’s disease, so selects this as the source node. The user limits the target node search to metanodes relating to gene function. The target node search box suggests nodes, sorted by the number of significant metapaths. When the user types in the target node box, the matches reorder based on search word similarity. Here, the user becomes interested in how the circadian rhythm might relate to Alzheimer’s disease. B. The webapp returns metapaths between Alzheimer’s disease and the circadian rhythm pathway. The user unchecks “precomputed only” to compute results for all metapaths with length ≤ 3, not just those that surpass the database inclusion threshold. The user sorts by adjusted p-value and selects 7 of the top 10 metapaths. C. Paths for the selected metapaths are ordered by their path score. The user selects 8 paths (1 from a subsequent page of results) to show in the graph visualization and highlights a single path involving ARNT2 for emphasis. D. A subgraph displays the previously selected paths. The user improves on the automated layout by repositioning nodes. Clicking an edge displays its properties, informing the user that association between Creutzfeldt-Jakob disease and NPAS2 was detected by GWAS.
Figure 5:
Figure 5:. Path-based metrics vary by node degree and network permutation status.
Each row shows a different metric of the DWPC distribution for the CbGpPWpG metapath — traversing Compound–binds–Gene–participates–Pathway–participates–Gene, selected for illustrative purposes. Metrics are computed for degree-groups, which is a specific pair of source degree (in this case, the source compound’s count of CbG edges) and target degree (in this case, the target gene’s count of GpPW edges). On the left, metrics are reported for the unpermuted hetnet and on the right for the 200 permuted hetnets. Hence, each cell on right summarizes 200 times the number of DWPCs as the corresponding cell on the left. The colormap is row normalized, such that its intensity peaks for the maximum value of each metric across the unpermuted and permuted values. Gray indicates null values.
Figure 6:
Figure 6:. From null distribution to p-value for DWPCs.
Null DWPC distributions are shown for three metapaths between Alzheimer’s disease and the circadian rhythm pathway, selected from Figure 2. For each metapath, null DWPCs are computed on 200 permuted hetnets and grouped according to source–target degree. Histograms show the null DWPCs for the degree group corresponding to Alzheimer’s disease and the circadian rhythm pathway (as noted in the plot titles by deg.) The proportion of null DWPCs that were zero is calculated, forming the “hurdle” of the null distribution model. The nonzero null DWPCs are modeled using a gamma distribution, which can be fit solely from a sample mean and standard deviation. The mean of nonzero null DWPCs is denoted with a diamond, with the standard deviation plotted twice as a line in either direction. Actual DWPCs are compared to the gamma-hurdle null distribution to yield a p-value.
Figure 7:
Figure 7:. Schema for the connectivity search backend relational database models.
Each Django model is represented as a table, whose rows list the model’s field names and types. Each model corresponds to a database table. Arrows denote foreign key relationships. The arrow labels indicate the foreign key field name followed by reverse relation names generated by Django (in parentheses).

References

    1. Himmelstein Daniel, Greene Casey, Baranzini Sergio, Renaming ‘heterogeneous networks’ to a more concise and catchy term, ThinkLab (2015-08-16) https://doi.org/f3mn4v, DOI: 10.15363/thinklab.d104 - DOI
    1. Himmelstein Daniel Scott, Lizee Antoine, Hessler Christine, Brueggeman Leo, Chen Sabrina L, Hadley Dexter, Green Ari, Khankhanian Pouya, Baranzini Sergio E, Systematic integration of biomedical knowledge prioritizes drugs for repurposing, eLife (2017-09-22) https://doi.org/cdfk, DOI: 10.7554/elife.26726 - DOI - PMC - PubMed
    1. Himmelstein Daniel, Announcing PharmacotherapyDB: the Open Catalog of Drug Therapies for Disease, ThinkLab (2016-03-15) https://doi.org/f3mqtv, DOI: 10.15363/thinklab.d182 - DOI
    1. Himmelstein Daniel, Our hetnet edge prediction methodology: the modeling framework for Project Rephetio, ThinkLab (2016-05-04) https://doi.org/f3qbmj, DOI: 10.15363/thinklab.d210 - DOI
    1. Liben-Nowell David, Kleinberg Jon, The link-prediction problem for social networks, Journal of the American Society for Information Science and Technology (2007) https://doi.org/c56765, DOI: 10.1002/asi.20591 - DOI

Publication types

LinkOut - more resources