Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 1;34(1):64-71.
doi: 10.1093/bioinformatics/btx419.

A nonparametric significance test for sampled networks

Affiliations

A nonparametric significance test for sampled networks

Andrew Elliott et al. Bioinformatics. .

Abstract

Motivation: Our work is motivated by an interest in constructing a protein-protein interaction network that captures key features associated with Parkinson's disease. While there is an abundance of subnetwork construction methods available, it is often far from obvious which subnetwork is the most suitable starting point for further investigation.

Results: We provide a method to assess whether a subnetwork constructed from a seed list (a list of nodes known to be important in the area of interest) differs significantly from a randomly generated subnetwork. The proposed method uses a Monte Carlo approach. As different seed lists can give rise to the same subnetwork, we control for redundancy by constructing a minimal seed list as the starting point for the significance test. The null model is based on random seed lists of the same length as a minimum seed list that generates the subnetwork; in this random seed list the nodes have (approximately) the same degree distribution as the nodes in the minimum seed list. We use this null model to select subnetworks which deviate significantly from random on an appropriate set of statistics and might capture useful information for a real world protein-protein interaction network.

Availability and implementation: The software used in this paper are available for download at https://sites.google.com/site/elliottande/. The software is written in Python and uses the NetworkX library.

Contact: ande.elliott@gmail.com or felix.reed-tsochas@sbs.ox.ac.uk.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(A) 2-hop Snowball Sampling Example. The seed list consists of node 1 (circle) only. The shape of the other nodes represent the distances from the seed node: squares represent nodes 1 hop from the seed, diamonds 2 hops from a seed and triangles 3 hops from a seed. Dashed edges represent cross-edges in a 2-hop snowball sample. (B)-(D) demonstrate sampling techniques based on paths. The network in (C) represents the unsampled network. (B) and (D) show the network in (C) sampled with the ‘All Shortest Paths’ (B) and Path2 (D) methods respectively. Seed nodes are represented by circles and other nodes are represented by squares (Color version of this figure is available at Bioinformatics online.)
Fig. 2.
Fig. 2.
A scatter diagram of accuracy versus purity of benchmark networks in which the sample is significant under our test (significance level0.025/4 due to a two-tailed adjustment and a Bonferroni correction) where colour represents the construction method used. An ideal method would have accuracy = 1 and purity = 1
Fig. 3.
Fig. 3.
Test results for different seed lists: smallest P-value, on a negative log scale. Results are shown for the Expression seed list (first panel); OMIM seed list (middle panel); and a breakdown of the P-value for the 4 statistics evaluated for the Path2 Expression network (final panel). Blue (left bar): original seed list; yellow (right bar): minimum seed list; red (horizontal line): significance level (0.025/4). Note due to the negative log scale on the y axis, values above the red line are significant. Each of the P-values are computed using 15 000 Monte Carlo realizations (Color version of this figure is available at Bioinformatics online.)
Fig. 4.
Fig. 4.
Histogram of differences in P-values of 100 2-hop Snowball Sample in the BioGRID PPI network with 25 initial random seed proteins and a bin size of 20 generated by adding additional redundant seed nodes. Each of the P-values are computed using 2000 Monte Carlo realizations (Color version of this figure is available at Bioinformatics online.)
Fig. 5.
Fig. 5.
Distribution of P-value results for 2-hop snowball sampling under our null model and the Configuration Model. 1000 networks are generated by selecting 25 random seeds, assortativity and average local clustering coefficient are calculated. For each network we calculate the P-value with respect to a random network (under our null model and the configuration model) (Color version of this figure is available at Bioinformatics online.)

References

    1. Ali W. et al. (2011) Protein interaction networks and their statistical analysis In: Handbook of Statistical Systems Biology. John Wiley & Sons, Ltd., Chichester.
    1. Berger S. et al. (2007) Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases. BMC Bioinformatics, 8, 372.. - PMC - PubMed
    1. Bernard H.R. et al. (2010) Counting hard-to-count populations: the network scale-up method for public health. Sex Transm. Infect., 86, ii11–ii15. - PMC - PubMed
    1. Chatraryamontri A. et al. (2013) The BioGRID interaction database: 2013 update. Nucleic Acids Res., 41, D816–D823. - PMC - PubMed
    1. Chuang H.Y. et al. (2007) Network-based classification of breast cancer metastasis. Mol. Syst. Biol., 3, 140.. - PMC - PubMed

Publication types