Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun 12:6:63.
doi: 10.1186/1752-0509-6-63.

Network methods for describing sample relationships in genomic datasets: application to Huntington's disease

Affiliations

Network methods for describing sample relationships in genomic datasets: application to Huntington's disease

Michael C Oldham et al. BMC Syst Biol. .

Abstract

Background: Genomic datasets generated by new technologies are increasingly prevalent in disparate areas of biological research. While many studies have sought to characterize relationships among genomic features, commensurate efforts to characterize relationships among biological samples have been less common. Consequently, the full extent of sample variation in genomic studies is often under-appreciated, complicating downstream analytical tasks such as gene co-expression network analysis.

Results: Here we demonstrate the use of network methods for characterizing sample relationships in microarray data generated from human brain tissue. We describe an approach for identifying outlying samples that does not depend on the choice or use of clustering algorithms. We introduce a battery of measures for quantifying the consistency and integrity of sample relationships, which can be compared across disparate studies, technology platforms, and biological systems. Among these measures, we provide evidence that the correlation between the connectivity and the clustering coefficient (two important network concepts) is a sensitive indicator of homogeneity among biological samples. We also show that this measure, which we refer to as cor(K,C), can distinguish biologically meaningful relationships among subgroups of samples. Specifically, we find that cor(K,C) reveals the profound effect of Huntington's disease on samples from the caudate nucleus relative to other brain regions. Furthermore, we find that this effect is concentrated in specific modules of genes that are naturally co-expressed in human caudate nucleus, highlighting a new strategy for exploring the effects of disease on sets of genes.

Conclusions: These results underscore the importance of systematically exploring sample relationships in large genomic datasets before seeking to analyze genomic feature activity. We introduce a standardized platform for this purpose using freely available R software that has been designed to enable iterative and interactive exploration of sample networks.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Network concepts provide a natural framework for describing relationships among samples in high-dimensional biological datasets. A motivational example. (A) Dendrogram produced by average linkage hierarchical clustering using 1 – ISA (intersample adjacency) for a subset of samples (prefrontal cortex [BA9] of CTRL subjects) from ref. [14]. (B) Dendrogram produced by average linkage hierarchical clustering using 1 – ISA for another subset of samples (cerebellum [CB] of CTRL subjects) from ref. [14]. (C) Standardized sample connectivities (Z.K) provide a different view of the BA9 CTRL samples. BA9_91_C (red) exhibited significantly lower connectivity than the other samples in this group, consistent with the dendrogram (A). (D) Standardized sample connectivities for the CB CTRL samples. Three samples (CB_80_C, CB_H123_C, and CB_67_C, in red) had Z.K values that were significantly lower than the others. Note that CB_67_C had much lower connectivity than CB_H110_C (blue), yet these two samples were indistinguishable in the dendrogram above (B). Black horizontal lines in (C) and (D) correspond to an optional Z.K threshold (here −2) for outlier removal; CTRL = control.
Figure 2
Figure 2
Sample network concepts reveal the profound effect of Huntington’s disease in caudate nucleus. Comparison of standardized sample connectivities (Z.K) and standardized clustering coefficients (Z.C) between control subjects (CTRL) and subjects with Huntington’s disease (HD) in prefrontal cortex (A; n = 9 CTRL and 16 HD), motor cortex (B; n = 16 CTRL and 14 HD), cerebellum (C; n = 23 CTRL and 34 HD), and caudate nucleus (D; n = 31 CTRL and 35 HD). Networks were constructed over all probe sets (n = 18,631) using all samples (CTRL and HD) from each brain region.
Figure 3
Figure 3
cor(K,C) depends upon network topology and network size. The Spearman correlation (cor(KC); z-axis) between the connectivity and the clustering coefficient as a function of network density (mean node adj. [adjacency]; x-axis) and network size (nodes; y-axis). Signed networks (β = 2) were simulated using the simulateModule function from the WGCNA R package [34]. The seed module eigengene (ME) consisted of 5,000 random, normally distributed features (mean = 0, sd = 1). The function parameters “corPower” and “propNegativeCor” were set to 0.75 and 0, respectively. The function parameter “minCor” was iteratively reduced from .95 to .05 by increments of .05, progressively degrading the strength of node connections; for each iteration, cor(KC) was calculated for module networks of various sizes (n = 10 to 100, by = 10).
Figure 4
Figure 4
Huntington’s disease exerts strong effects on specific gene co-expression modules in human caudate nucleus. Analysis of human caudate nucleus (CN) sample network properties for each of 23 gene co-expression modules previously identified in CN; colors correspond to the original gene co-expression module labels from [21]. (A) For each module sample network, the Spearman correlations cor(KC) are plotted for control (CTRL) and Huntington’s disease (HD) subjects. Each point corresponds to a module. Black line: y = x. (B) The log-transformed P–value of the difference between cor(KC) for CTRL and HD subjects is reported for each module (Methods). (C) The extent of differential expression (DE) between CTRL and HD was assessed for each module by using Student’s t-test of DE for the module eigengene (ME; i.e. the first principal component obtained by singular value decomposition of the module expression matrix) between CTRL and HD. (D) Comparison of the module significance levels reported in (B) and (C); linear least squares regression line in black. p.Diff.cor(K,C) denotes the P-value for testing the differences of cor(KC) between the CTRL and HD module sample networks. (B–D) Blue lines: P = .05; red lines: Bonferroni correction for multiple comparisons.
Figure 5
Figure 5
cor(K,C) distinguishes sample subgroups in the absence of differential expression. Analysis of a simulated gene expression module consisting of 500 genes and 100 samples. Samples were assigned to one of three subgroups based on simulated disease status: “control” (n = 50; darkgreen), “moderate” (n = 25; red), or “severe” (n = 25; turquoise) (Methods). (A) Average linkage hierarchical clustering of samples using 1 – ISA (intersample adjacency) as a dissimilarity measure. (B) Distributions of module eigengene (ME) values by sample subgroup. Note that these distributions are not significantly different (P = 0.18, Kruksal-Wallis test), indicating that there is no differential expression associated with disease status at the modular level. (C) When depicted in terms of Z.K and Z.C, control and affected subjects segregated into two distinct groups (linear least squares regression lines in black [control] and red [affected]). (D) Heat map of simulated gene expression levels. Rows correspond to genes and columns correspond to samples. Green = low expression; red = high expression
Figure 6
Figure 6
Caudate nucleus samples exhibit significant segregation by diagnosis in gene co-expression module M8C (salmon). Analysis of caudate nucleus (CN) sample network properties for genes comprising the CN salmon co-expression module M8C [21]. (A) Average linkage hierarchical clustering of samples using 1 – ISA (intersample adjacency) as a dissimilarity measure. Colors denote control (CTRL) subjects (darkgreen; n = 31) and Huntington’s disease (HD) subjects with varying grades of disease severity: HD grade 0 (black; n = 2), HD grade 1 (red; n = 11), HD grade 2 (turquoise; n = 16), HD grade 3 (blue; n = 5), and HD grade 4 (brown; n = 1). Standardized sample connectivities (Z.K; B) and standardized sample clustering coefficients (Z.C; C). (D) HD and CTRL samples segregated into two distinct groups when depicted in terms of Z.K and Z.C (linear least squares regression line in black [CTRL] and red [HD]). (E) Multivariate linear regression revealed a highly significant effect of diagnosis (Dx) on the salmon module eigengene. Blue line: P = .05; red line: Bonferroni correction for multiple comparisons. (F) Heat map of expression levels for genes comprising the salmon co-expression module M8C. Rows correspond to probe sets (genes) and columns correspond to samples. Green = low expression; red = high expression. Samples in (B–D, F) are colored as in (A).

References

    1. Nugent R, Meila M. An overview of clustering applied to molecular biology. Methods Mol Biol. 2010;620:369–404. doi: 10.1007/978-1-60761-580-4_12. - DOI - PubMed
    1. Carugo O. Clustering criteria and algorithms. Methods Mol Biol. 2010;609:175–196. doi: 10.1007/978-1-60327-241-4_11. - DOI - PubMed
    1. Carugo O. Proximity measures for cluster analysis. Methods Mol Biol. 2010;609:163–174. doi: 10.1007/978-1-60327-241-4_10. - DOI - PubMed
    1. Frades I, Matthiesen R. Overview on techniques in cluster analysis. Methods Mol Biol. 2010;593:81–107. doi: 10.1007/978-1-60327-194-3_5. - DOI - PubMed
    1. Kerr G, Ruskin HJ, Crane M, Doolan P. Techniques for clustering gene expression data. Comput Biol Med. 2008;38(3):283–293. doi: 10.1016/j.compbiomed.2007.11.001. - DOI - PubMed

Publication types

LinkOut - more resources