. 2024 Jan 2:13:giae001.

doi: 10.1093/gigascience/giae001.

The probability of edge existence due to node degree: a baseline for network-based predictions

Michael Zietz^{1

2

3}, Daniel S Himmelstein^{1

4}, Kyle Kloster^{5

6}, Christopher Williams¹, Michael W Nagle^{7

8

9}, Casey S Greene^{1

10

11}

Affiliations

¹ Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA.
² Department of Physics & Astronomy, University of Pennsylvania, Philadelphia, PA 19104, USA.
³ Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA.
⁴ Related Sciences, Denver, CO 80202, USA.
⁵ Carbon, Inc., Redwood City, CA 94063, USA.
⁶ Department of Computer Science, North Carolina State University, Raleigh, NC 27606, USA.
⁷ Internal Medicine Research Unit, Pfizer Worldwide Research, Development, and Medical, Cambridge, MA 02139, USA.
⁸ Integrative Biology, Internal Medicine Research Unit, Worldwide Research, Development, and Medicine, Pfizer Inc., Cambridge, MA 02139, USA.
⁹ Human Biology Integration Foundation, Deep Human Biology Learning, Eisai Inc., Cambridge, MA 02140, USA.
¹⁰ Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA.
¹¹ Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA.

PMID: 38323677
PMCID: PMC10848215
DOI: 10.1093/gigascience/giae001

The probability of edge existence due to node degree: a baseline for network-based predictions

Michael Zietz et al. Gigascience. 2024.

. 2024 Jan 2:13:giae001.

doi: 10.1093/gigascience/giae001.

Authors

Michael Zietz^{1

2

3}, Daniel S Himmelstein^{1

4}, Kyle Kloster^{5

6}, Christopher Williams¹, Michael W Nagle^{7

8

9}, Casey S Greene^{1

10

11}

Affiliations

¹ Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA.
² Department of Physics & Astronomy, University of Pennsylvania, Philadelphia, PA 19104, USA.
³ Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA.
⁴ Related Sciences, Denver, CO 80202, USA.
⁵ Carbon, Inc., Redwood City, CA 94063, USA.
⁶ Department of Computer Science, North Carolina State University, Raleigh, NC 27606, USA.
⁷ Internal Medicine Research Unit, Pfizer Worldwide Research, Development, and Medical, Cambridge, MA 02139, USA.
⁸ Integrative Biology, Internal Medicine Research Unit, Worldwide Research, Development, and Medicine, Pfizer Inc., Cambridge, MA 02139, USA.
⁹ Human Biology Integration Foundation, Deep Human Biology Learning, Eisai Inc., Cambridge, MA 02140, USA.
¹⁰ Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA.
¹¹ Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA.

PMID: 38323677
PMCID: PMC10848215
DOI: 10.1093/gigascience/giae001

Abstract

Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections using network permutation to generate features that depend only on degree. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Researchers seeking to predict new or missing edges in biological networks should use our permutation approach to obtain a baseline for performance that may be nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

Keywords: Python; XSwap; bioinformatics; edge prediction; edge prior; heterogeneous; knowledge graphs; networks; node degree; permutation.

PubMed Disclaimer

Conflict of interest statement

This work was supported, in part, by Pfizer Worldwide Research, Development, and Medical.

Figures

**Figure 1:**
Biomedical networks are characterized by nonuniform degree distributions. Eight degree distributions are plotted for 6 edge types, Hetionet v1.0 [5]. Hetionet integrates subnetworks for 24 different edge types, the degree distributions of which are analyzed separately. Furthermore, bipartite (e.g., Anatomy→expresses→Gene) and directed (e.g., Gene→regulates→Gene) graphs (Hetionet edge types) have both source and target degrees that must be assessed separately. Undirected edge types (e.g., Compound–resembles–Compound) have only a single degree distribution. Degree distributions are nonuniform and vary greatly between different networks. The y-axis is log₁₀-scaled to accommodate the common occurrence where most nodes have low degree while a small portion of nodes have high degree. Several distributions have nodes that reach the maximum degree, corresponding to a node being connected to all other possible nodes. Zero-degree nodes are not displayed, since methodological limitations often result in edge data only existing for a subset of nodes.

**Figure 2:**
XSwap algorithm pseudocode. (A) XSwap algorithm presented by Hanhijärvi et al. [15]. (B) Extension of the XSwap algorithm to other types of networks.

**Figure 3:**
Modified XSwap algorithm graphical explanation.

**Figure 4:**
The XSwap-derived edge prior can be analytically approximated. The analytical approximation is plotted against the XSwap-derived edge prior for 3 networks (edge types) from Hetionet. The strong correlation suggests that the approximation will be suitable for applications where computation time is a limiting factor.

**Figure 5:**
(A) Degree distributions of networks with and without degree bias can be very different. Data on PPI and TF-TG were split between literature-derived and systematically derived networks. In both cases, the networks exhibit large differences in degree distribution. Coauthorship relationship networks split by date of first coauthorship roughly share their degree distributions. (B) Comparison of individual node degrees between different networks. Not only are the overall degree distributions different, but individual nodes can have systematically different degrees between 2 networks. Uniform random sampling produces linearly correlated node degree, while nonrandom sampling produces noncorrelated degree. Systematically derived networks are not uniformly sampled from literature-derived networks or vice versa. Seventy percent of literature edges were sampled with uniform probability for the “Subsampled holdout” network.

**Figure 6:**
The edge prior accurately assigns the probability of edge existence. (A) Calibration curves for full network reconstruction of 20 networks from Hetionet. For every unique predictor value on the horizontal axis, the fraction of node pairs with that predictor value having an edge in the network is shown on the vertical axis. The permutation-based edge prior’s calibration was superior to the other 2 strategies based on degree. (B) Calibration curves for sampled network reconstruction. The edge prior shows superior calibration in the 20 Hetionet networks. (C) Individual Hetionet edge-type calibration estimated by the 2-component decomposition of the Brier score, in which lower scores indicate better calibration. The edge prior has excellent calibration in unsampled and sampled networks, and each considered method is sensitive to shifts in the degree distribution.

**Figure 7:**
Degree can predict edges within a given network but does not generalize to networks with different degree distributions. The edge prior is able to reconstruct the networks on which it was computed (task 1, “unsampled,” 20 different networks) with high performance. When computed on a sampled network, the edge prior can reconstruct the unsampled network with slightly lower performance (task 2, “sampled,” 20 different networks). However, when computed on a completely different network (having a different degree distribution) of the same type of data, the edge prior’s performance is greatly reduced (task 3, “separate,” 3 different networks). The performance reduction from computing predictors on sampled networks is real but far smaller compared to a new degree distribution. This indicates that while degree can be effective for network reconstruction, it is far less effective in predicting edges from a different degree distribution.

**Figure 8:**
Common edge prediction metrics correlate with node degree. Five common edge prediction features (Supplementary Table S2) are correlated with node degree on the STRING PPI network [24]. All 5 features show a positive relationship with degree, although the magnitude of this correlation is highly variable. The preferential attachment index is understandably perfectly correlated because it is equal to the product of source and target degree. Each panel indicates the Pearson correlation (“r”) between feature and degree in the lower right corner.

**Figure 9:**
Identifying the fraction of a metric’s performance resulting from degree alone. Network reconstruction performances by 5 edge prediction features. Dotted red line indicates performance of the edge prior. Each feature was computed on both the unpermuted and 100 permutations of the STRING PPI network.

See this image and copyright information in PMC

Update of

The probability of edge existence due to node degree: a baseline for network-based predictions.
Zietz M, Himmelstein DS, Kloster K, Williams C, Nagle MW, Greene CS. Zietz M, et al. bioRxiv [Preprint]. 2023 Jan 6:2023.01.05.522939. doi: 10.1101/2023.01.05.522939. bioRxiv. 2023. Update in: Gigascience. 2024 Jan 2;13:giae001. doi: 10.1093/gigascience/giae001. PMID: 36711569 Free PMC article. Updated. Preprint.

References

1. Williams RJ. Biology, methodology or chance? The degree distributions of bipartite ecological networks. PLoS One. 2011;6:e17645. 10.1371/journal.pone.0017645. - DOI - PMC - PubMed
1. Kelly WP, Ingram PJ, Stumpf MPH. The degree distribution of networks: statistical model selection. Bacterial Mol Netw. 2011;804:245–62. 10.1007/978-1-61779-361-5_13. - DOI - PubMed
1. Broido AD, Clauset A. Scale-free networks are rare. Nat Commun. 2019;10:1017. 10.1038/s41467-019-08746-5. - DOI - PMC - PubMed
1. Barabási A-L, Albert R. Emergence of scaling in random networks. Science. 1999;286:509–12. 10.1126/science.286.5439.509. - DOI - PubMed
1. Himmelstein DS, Lizee A, Hessler C, et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife. 2017;6:e26726. 10.7554/elife.26726. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The probability of edge existence due to node degree: a baseline for network-based predictions

Affiliations

The probability of edge existence due to node degree: a baseline for network-based predictions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources