Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Nov 15;10(1):106.
doi: 10.1186/s12918-016-0349-1.

DGCA: A comprehensive R package for Differential Gene Correlation Analysis

Affiliations

DGCA: A comprehensive R package for Differential Gene Correlation Analysis

Andrew T McKenzie et al. BMC Syst Biol. .

Abstract

Background: Dissecting the regulatory relationships between genes is a critical step towards building accurate predictive models of biological systems. A powerful approach towards this end is to systematically study the differences in correlation between gene pairs in more than one distinct condition.

Results: In this study we develop an R package, DGCA (for Differential Gene Correlation Analysis), which offers a suite of tools for computing and analyzing differential correlations between gene pairs across multiple conditions. To minimize parametric assumptions, DGCA computes empirical p-values via permutation testing. To understand differential correlations at a systems level, DGCA performs higher-order analyses such as measuring the average difference in correlation and multiscale clustering analysis of differential correlation networks. Through a simulation study, we show that the straightforward z-score based method that DGCA employs significantly outperforms the existing alternative methods for calculating differential correlation. Application of DGCA to the TCGA RNA-seq data in breast cancer not only identifies key changes in the regulatory relationships between TP53 and PTEN and their target genes in the presence of inactivating mutations, but also reveals an immune-related differential correlation module that is specific to triple negative breast cancer (TNBC).

Conclusions: DGCA is an R package for systematically assessing the difference in gene-gene regulatory relationships under different conditions. This user-friendly, effective, and comprehensive software tool will greatly facilitate the application of differential correlation analysis in many biological studies and thus will help identification of novel signaling pathways, biomarkers, and targets in complex biological systems and diseases.

Keywords: Breast cancer; Differential coexpression; Differential correlation; Multiscale clustering analysis; R package; RNA-Seq; TP53; Triple negative breast cancer.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
An example demonstrating the theoretical difference between differential expression and differential correlation. The top panel shows the RNA expression levels for two example genes in two example conditions. The bottom left panel shows that both of these genes have decreased expression values from condition A to condition B. On the other hand, the bottom right panel shows that these genes have positive correlation in condition A but no correlation in condition B, which could not have been predicted on the basis of the differential expression relationships alone
Fig. 2
Fig. 2
Workflow for the Differential Gene Correlation Analysis (DGCA) R package. Users input a gene expression matrix, a design matrix to specify the conditions, and a comparison vector to specify which conditions will be compared. DGCA then calculates the gene pair correlations within each condition, processes these correlation values, and compares them to build up a difference in correlation matrix. If permutation testing is chosen, DGCA will perform the same procedure on permuted gene expression matrices. These permutation samples are used to estimate an empirical false discovery rate. After investigators choose the significance threshold for differential correlation between conditions (if any) to choose downstream gene pairs, they can use DGCA’s capacities for visualization, gene ontology (GO) enrichment, and/or network construction
Fig. 3
Fig. 3
Definition of differential correlation classes. This diagram demonstrates the definition of differential correlation classes used throughout DGCA in the case of two conditions. The default p-value significance level threshold (α) parameter setting for the hypothesis test of non-zero correlation in one individual condition in DGCA and throughout this manuscript is 0.05, although users of DGCA can adjust this parameter if they choose to. In each condition, gene pairs are defined as having a non-significant correlation (p-value < α), a significant and positive correlation (p-value < α, correlation (ρ) > 0), or a significant and negative correlation (p-value < α, correlation (ρ) < 0). The Cartesian product of the 3 possible correlation classes in one condition with the 3 in the other condition yields a total of 9 possible differential correlation classes. Note that, theoretically, a gene pair with a significant positive correlation in one condition and a non-significant correlation in another condition may not be significantly differentially correlated between these conditions since the correlation class identification is independent of the differential correlation hypothesis test
Fig. 4
Fig. 4
Comparing DGCA to alternatives in the differential correlation simulation study. a-d: Representative receiver operating characteristic (ROC) curves show the ability of DGCA, EBcoexpress, and Discordant with and without the use of the Fisher z-transformation (FT) to accurately detect truly differential correlated gene pairs at different simulated sample sizes in each condition, including n = 10 a, n = 30 b, n = 50 c, and n = 100 d. Black lines show the ROC curves of DGCA, while blue lines show the ROC curves of EBcoexpress, red lines show Discordant without the Fisher transformation, and orange lines show Discordant with the Fisher transformation. e: Comparison of area under curve (AUC) statistics for 5 runs of the simulation study using each of the methods at different numbers of samples, where errors bars represent the standard error of the mean. Asterisks indicate a Bonferroni-adjusted significant difference in the AUCs (p < 0.0083) between DGCA and each of the other methods, tested using a two-sided t-test
Fig. 5
Fig. 5
Comparing DGCA to alternatives segregated by the strength of correlation difference. a-c: Representative receiver operating characteristic (ROC) curves show the ability of DGCA a, EBcoexpress b, and Discordant (with the use of the Fisher z-transformation) c to accurately detect truly differential correlated gene pairs in each of the six differential correlation classes that specify a difference in correlations between conditions, as well as when comparing all gene pairs with a gain in correlation (GOC) or a loss of correlation (LOC) in one condition compared to the other, at a sample size of n = 30. d-e: Comparison of area under curve (AUC) statistics for 5 runs of the simulation study using each of the three methods at different numbers of samples, where errors bars represent the standard error of the mean, segregated to only those gene pairs with a strong difference in correlation (absolute value of difference in ρ = 1; d) or segregated to only those gene pairs with a medium strength difference in correlation (absolute value of difference in ρ = 0.5; e). Asterisks indicate a Bonferroni-adjusted significant difference in the AUCs (p < 0.0083) between DGCA and each of the other methods, tested using a two-sided t-test
Fig. 6
Fig. 6
Expression correlations of five genes with TP53 in samples with and without p53 mutations. Expression values for genes from breast cancer samples without p53 mutation, or p53 wildtype (WT; red, n = 590), compared to samples with non-silent p53 coding mutations (blue, n = 254). For each gene pair, the expression of TP53 is on the x-axis, while the expression of GNB2L1 a, RPS12 b, CDKN1A c, TSC22D1 d, and DGKA e are on the y-axis. For visualization purposes, a linear model was fit to the data in each comparison, with the grey lines representing 95% confidence intervals. This visualization was made using a DGCA wrapper function to the ggplot2 R package
Fig. 7
Fig. 7
Comparing differential expression and differential correlation with TP53 in samples with and without p53 mutations. For each gene, we plot both DGCA’s calculated differential correlation z-score between that gene and TP53 in p53 non-mutated breast cancer samples and p53-mutated samples (x-axis), as well as limma’s differential expression t statistic for that gene’s differential expression between the same p53 wildtype samples and p53-mutated samples (y-axis). When differential correlation z-scores are calculated on positive correlation values only a, the Spearman correlation between these two measures is not significant (ρ = 0.08, p-value = 0.15), and when differential correlation z-scores are calculated across all correlation values b, the Spearman correlation between these two measures is also not significant (ρ = 0.06, p-value = 0.30). The blue line represents a linear model of the best fit, with the grey lines representing 95% confidence intervals, computed using ggplot2
Fig. 8
Fig. 8
Differential correlations between the genes in the p53 pathway in p53 non-mutated and p53-mutated breast cancer samples. The top panel shows a histogram of the correlation values pooled between both conditions that also demonstrates the color schema used to indicate the expression correlation strength for each gene pairs. The bottom panel shows a heatmap of correlation values in the two conditions. The bottom left panel shows the correlation of gene pairs in the non-p53-mutated breast cancer RNA-seq samples, while the upper right panel shows the correlation of gene pairs in the samples with (non-silent) p53 mutations. The diagonal is marked black as each gene’s self-correlations from both conditions are omitted. Genes are ordered by the median difference of z-score in correlation between that gene and all of the other genes present in both the rows and columns. This plot was made using the DGCA wrapper function to the gplots R package
Fig. 9
Fig. 9
Summary of differential correlation results of PTEN in samples with and without PTEN mutations. a-b: Expression values for genes from breast cancer samples without PTEN mutation, or PTEN wildtype (WT; red, n = 816), compared to samples with non-silent PTEN mutations (blue, n = 27). For each gene pair, the expression of PTEN is on the x-axis, while the expression of FASLG a and IPCEF1 b are on the y-axis. For visualization purposes, a linear model was fit to the data in each comparison, with the grey lines representing 95% confidence intervals. c: For each gene, we plot both DGCA’s calculated differential correlation z-score between that gene and PTEN in PTEN non-mutated breast cancer samples and PTEN-mutated samples (x-axis), as well as limma’s differential expression t statistic for that gene’s differential expression between the same PTEN non-mutated samples and PTEN-mutated samples (y-axis). The correlation between the two measures is not significant (ρ = 0.13, p = 0.29). The blue line represents a linear model of the best fit, with the grey lines representing 95% confidence intervals
Fig. 10
Fig. 10
Gene ontology enrichment of genes associated with gain or loss of correlation in ER+ samples compared to TN. Genes identified as members of gene pairs with a significant gain in correlation (q < 0.01) in the estrogen receptor-positive (ER+) samples compared to the triple negative (TN) breast cancer samples were used as inputs to gene ontology (GO) enrichment analysis in GOstats. GO terms with less than 50 or greater than 600 gene symbol members were filtered out included for the purposes of interpretability. Odds ratios for the GO enrichment of terms were compared between these two groups, and p-values were adjusted using the Benjamini-Hochberg method for each GO term category to yield false discovery rates (FDR). Up to 5 GO terms enriched in one of the groups compared to the other at FDR < 0.3 are shown. No terms from the “Molecular Function” GO term category portion passed this FDR threshold and are therefore not shown. This plot was made using the DGCA wrapper function to the plotrix R package
Fig. 11
Fig. 11
Planar filtered network of differentially correlated genes. Gene pairs with a significant change in correlation (q < 0.05) in estrogen receptor-positive (ER+) compared to triple negative (TN) breast cancer samples were to construct a planar filtered network shown here. The modules identified at p < 0.05 in this network with between 100 and 800 members are identified with distinct colors. This plot was made using Cytoscape (version 3.2.1)
Fig. 12
Fig. 12
Enrichment of GO terms and differential correlation classes in the multiscale modules. This plot displays the enrichments of each of the multiscale modules (whose sizes are between 100 and 800 members) in the ER+ vs TN differential correlation network. The left panel shows the enrichment (Benjamini-Hochberg adjusted -log10 p-value) of edges in each of the differential correlation (DC) classes in each of the modules. The right panel shows the most significantly enriched gene ontology (GO) term (-Benjamini-Hochberg adjusted log10 p-value) for the genes in each of the corresponding modules, along with the GO enrichment of that same GO term in all the other modules. This plot was made using a DGCA wrapper function to R package ggplot2
Fig. 13
Fig. 13
The differential correlation module most enriched in gene pairs that gain in correlation in ER+ breast cancer. This module of genes was identified using MEGENA and chosen for downstream analysis because it was most enriched for the gene pairs with positive correlations in ER+ breast cancer but no significant correlation in triple negative (TN) breast cancer. Node size and gene symbol text size are proportional to the number of connections for each gene. Edges are colored according to the differential correlation class (see Legend), while edge weight is proportional to the absolute value of the z-score for the difference of correlation between ER+ and TN breast cancer samples. This plot was made using Cytoscape (version 3.2.1)
Fig. 14
Fig. 14
The differential correlation module most enriched for the gene pairs that gain correlation in TN breast cancer. This module of genes was identified using MEGENA and chosen for downstream analysis because it was most enriched for the gene pairs that were positively correlated in TN breast cancer but were either negatively or not significantly correlated in ER+ breast cancer. Node size and gene symbol text size are proportional to the number of connections for each gene. Edges are colored according to the differential correlation class (see Legend), while edge weight is proportional to the absolute value of the z-score for the difference of correlation for that gene pair between ER+ and TN breast cancer samples. This plot was made using Cytoscape (version 3.2.1)
Fig. 15
Fig. 15
Performance comparison of DGCA/MEGENA, DICER and DiffCoEx in detecting differential correlation modules. In the simulation study, there were two modules and 30 genes were assigned to each module, within which a fraction (κ) of the gene pairs had differential correlation, where κ varied from 0.5 to 1 across simulation runs. We also simulated different sample sizes in each condition (n = 100, 200, 300, and 400). The highest sensitivity (a; dots = means, lines = standard errors of the mean) and Jaccard index b between the detected and true modules were calculated and averaged for each method (black for DGCA/MEGENA, blue for DICER and red for DiffCoEx) across all the independent runs in the simulation study. In the case that no module was detected by a method, all the genes were considered to belong to one single module for calculating Jaccard index, indicated by the dotted horizontal line. Each Jaccard index mean was calculated from ten simulation runs and two true modules per run
Fig. 16
Fig. 16
Comparison of modules detected with DGCA/MEGENA to those detected with DiffCoEx and DICER. a: The number of modules significantly enriched (Fisher’s Exact Test; FDR < 0.3) that were detected by each of DGCA/MEGENA (red), DiffCoEx (green), and DICER (blue) in five different gene sets selected for their relevance to breast cancer. b: The proportion of modules detected by each of the three methods that are not significantly enriched in (i.e., do not overlap with) any of the modules detected by one (off-diagonals) or both (diagonals) of the other methods, by Fisher’s Exact Test, Benjamini-Hochberg adjusted p-value < 0.05. Columns labels denote the reference method compared; e.g., 62% of DGCA/MEGENA-detected modules do not significantly overlap with modules detected by either method, and 78% of DGCA/MEGENA-detected modules do not significantly overlap with any DICER-detected modules

References

    1. Zhu J, Zhang B, Schadt EE. A systems biology approach to drug discovery. Adv Genet. 2008;60:603–635. - PubMed
    1. Auffray C, Chen Z, Hood L. Systems medicine: the future of medical genomics and healthcare. Genome Med. 2009;1:2. doi: 10.1186/gm2. - DOI - PMC - PubMed
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. - DOI - PMC - PubMed
    1. Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003;4:210. doi: 10.1186/gb-2003-4-4-210. - DOI - PMC - PubMed
    1. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4:Article17. - PubMed

Publication types

MeSH terms

Substances