Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jan 6:2023.01.05.522939.
doi: 10.1101/2023.01.05.522939.

The probability of edge existence due to node degree: a baseline for network-based predictions

Affiliations

The probability of edge existence due to node degree: a baseline for network-based predictions

Michael Zietz et al. bioRxiv. .

Update in

Abstract

Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree's predictive performance diminishes when the networks used for training and testing-despite measuring the same biological relationships-were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Biomedical networks are characterized by non-uniform degree distributions.
Eight degree distributions are plotted for six edge types Hetionet v1.0 [5]. Hetionet integrates subnetworks for 24 different edge types, the degree distributions of which are analyzed separately. Furthermore, bipartite (e.g. Anatomy→expresses→Gene) and directed (e.g. Gene→regulates→Gene) graphs (Hetionet edge types) have both source and target degrees that must be assessed separately. Undirected edge types (e.g Compound–resembles–Compound) have only a single degree distribution. Degree distributions are non-uniform and vary greatly between different networks. The y-axis is log10-scaled to accommodate the common occurrence where most nodes have low degree while a small portion of nodes have high degree. Several distributions have nodes that reach the maximum degree, corresponding to a node being connected to all other possible nodes. Zero-degree nodes are not displayed, since methodological limitations often result in edge data only existing for a subset of nodes.
Figure 2:
Figure 2:. XSwap algorithm pseudocode.
A. XSwap algorithm presented by Hanhijärvi, et al. [15]. B. Extension of the XSwap algorithm to other types of networks.
Figure 3:
Figure 3:. The XSwap-derived edge prior can be analytically approximated.
The analytical approximation is plotted against the XSwap-derived edge prior for three networks (edge types) from Hetionet. The strong correlation suggests that the approximation will be suitable for applications where computation time is a limiting factor.
Figure 4:
Figure 4:
A. Degree distributions of networks with and without degree bias can be very different. Data on PPI and TF-TG were split between literature-derived and systematically-derived networks. In both cases, the networks exhibit large differences in degree distribution. Co-authorship relationship networks split by date of first co-authorship roughly share their degree distributions. B. Comparison of individual node degrees between different networks. Not only are the overall degree distributions different, but individual nodes can have systematically different degrees between two networks. Uniform random sampling produces linearly-correlated node degree, while non-random sampling produces non-correlated degree. Systematically-derived networks are not uniformly sampled from literature-derived networks or vice versa. 70% of literature edges were sampled with uniform probability for the “Subsampled holdout” network.
Figure 5:
Figure 5:. Degree can predict edges within a given network but does not generalize to networks with different degree distributions
The edge prior is able to reconstruct the networks on which it was computed (Task 1, “unsampled”, 20 different networks) with high performance. When computed on a sampled network, the edge prior can reconstruct the unsampled network with slightly lower performance (Task 2, “sampled”, 20 different networks). However, when computed on a completely different network (having a different degree distribution) of the same type of data, the edge prior’s performance is greatly reduced (Task 3, “separate”, 3 different networks). The performance reduction from computing predictors on sampled networks is real but far smaller compared to a new degree distribution. This indicates that while degree can be effective for network reconstruction, it is far less effective in predicting edges from a different degree distribution.
Figure 6:
Figure 6:. The edge prior accurately assigns the probability of edge existence.
A. Calibration curves for full network reconstruction of 20 networks from Hetionet. For every unique predictor value on the horizontal axis, the fraction of node pairs with that predictor value having an edge in the network is shown on the vertical axis. The permutation-based edge prior’s calibration was superior to the other two strategies based on degree. B. Calibration curves for sampled network reconstruction. The edge prior shows superior calibration in the 20 Hetionet networks. C. Individual Hetionet edge type calibration estimated by the two-component decomposition of the Brier score, in which lower scores indicate better calibration. The edge prior has excellent calibration in unsampled and sampled networks, and each considered method is sensitive to shifts in the degree distribution.
Figure 7:
Figure 7:. Common edge-prediction metrics correlate with node degree.
Five common edge-prediction features (Supplemental table 2) are correlated with node degree on the STRING PPI network [18]. All five features show a positive relationship with degree, though the magnitude of this correlation is highly variable. The preferential attachment index is understandably perfectly correlated because it is equal to the product of source and target degree. Each panel indicates the Pearson correlation (“r”) between feature and degree in the lower right corner.
Figure 8:
Figure 8:. Identifying the fraction of a metric’s performance resulting from degree alone.
Network reconstruction performances by five edge prediction features. Dotted red line indicates performance of the edge prior. Each feature was computed on both the unpermuted and 100 permutations of the STRING PPI network.

References

    1. Williams Richard J, Biology, Methodology or Chance? The Degree Distributions of Bipartite Ecological Networks, PLoS ONE (2011-03-03) https://doi.org/fmtk6x, DOI: 10.1371/journal.pone.0017645 - DOI - PMC - PubMed
    1. Kelly William P, Ingram Piers J, Stumpf Michael PH, The Degree Distribution of Networks: Statistical Model Selection, Bacterial Molecular Networks (2011-10-28) https://doi.org/ddx5rx, DOI: 10.1007/978-1-61779-361-5_13 - DOI - PubMed
    1. Broido Anna D, Clauset Aaron, Scale-free networks are rare, Nature Communications (2019-03-04) https://doi.org/gfztz9, DOI: 10.1038/s41467-019-08746-5 - DOI - PMC - PubMed
    1. Barabási Albert-László, Albert Réka, Emergence of Scaling in Random Networks, Science (1999-10-15) https://doi.org/ccsmnz, DOI: 10.1126/science.286.5439.509 - DOI - PubMed
    1. Himmelstein Daniel Scott, Lizee Antoine, Hessler Christine, Brueggeman Leo, Chen Sabrina L, Hadley Dexter, Green Ari, Khankhanian Pouya, Baranzini Sergio E, Systematic integration of biomedical knowledge prioritizes drugs for repurposing, eLife (2017-09-22) https://doi.org/cdfk, DOI: 10.7554/elife.26726 - DOI - PMC - PubMed

Publication types

LinkOut - more resources