. 2019 Dec 1;35(23):5011-5017.

doi: 10.1093/bioinformatics/btz357.

Exact hypothesis testing for shrinkage-based Gaussian graphical models

Victor Bernal^{1

2}, Rainer Bischoff², Victor Guryev³, Marco Grzegorczyk¹, Peter Horvatovich²

Affiliations

¹ Bernoulli Institute, University of Groningen, Groningen AG, The Netherlands.
² Department of Analytical Biochemistry, Groningen Research Institute of Pharmacy.
³ European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen AV, The Netherlands.

PMID: 31077287
PMCID: PMC6901079
DOI: 10.1093/bioinformatics/btz357

Exact hypothesis testing for shrinkage-based Gaussian graphical models

Victor Bernal et al. Bioinformatics. 2019.

. 2019 Dec 1;35(23):5011-5017.

doi: 10.1093/bioinformatics/btz357.

Authors

Victor Bernal^{1

2}, Rainer Bischoff², Victor Guryev³, Marco Grzegorczyk¹, Peter Horvatovich²

Affiliations

¹ Bernoulli Institute, University of Groningen, Groningen AG, The Netherlands.
² Department of Analytical Biochemistry, Groningen Research Institute of Pharmacy.
³ European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, Groningen AV, The Netherlands.

PMID: 31077287
PMCID: PMC6901079
DOI: 10.1093/bioinformatics/btz357

Abstract

Motivation: One of the main goals in systems biology is to learn molecular regulatory networks from quantitative profile data. In particular, Gaussian graphical models (GGMs) are widely used network models in bioinformatics where variables (e.g. transcripts, metabolites or proteins) are represented by nodes, and pairs of nodes are connected with an edge according to their partial correlation. Reconstructing a GGM from data is a challenging task when the sample size is smaller than the number of variables. The main problem consists in finding the inverse of the covariance estimator which is ill-conditioned in this case. Shrinkage-based covariance estimators are a popular approach, producing an invertible 'shrunk' covariance. However, a proper significance test for the 'shrunk' partial correlation (i.e. the GGM edges) is an open challenge as a probability density including the shrinkage is unknown. In this article, we present (i) a geometric reformulation of the shrinkage-based GGM, and (ii) a probability density that naturally includes the shrinkage parameter.

Results: Our results show that the inference using this new 'shrunk' probability density is as accurate as Monte Carlo estimation (an unbiased non-parametric method) for any shrinkage value, while being computationally more efficient. We show on synthetic data how the novel test for significance allows an accurate control of the Type I error and outperforms the network reconstruction obtained by the widely used R package GeneNet. This is further highlighted in two gene expression datasets from stress response in Eschericha coli, and the effect of influenza infection in Mus musculus.

Availability and implementation: https://github.com/V-Bernal/GGM-Shrinkage.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Geometry of the partial correlation. The vectors $\vec{x}$ , $\vec{y}$ and $\vec{z}$ represent the random variables *X, Y* and Z in subject space. In Panel (a), the correlation $r$ between X and Y is the cosine of $α$ . In Panel (b), the partial correlation between X and Y can be interpreted as the cosine of the angle $β$ . That is the cosine between the projection of $\vec{x}$ and $\vec{y}$ onto a plane orthogonal to $\vec{z}$ . The shrinkage effect consists in that the vectors $\vec{x}$ , $\vec{y}$ and $\vec{z}$ are transformed to $\vec{x^{λ}}, \vec{y^{λ}}$ and $\vec{z^{λ}}$ such that their lengths remain 1, and only the angles between each other change. In other words, the transformed vectors become less correlated. In Panel (c), the geometrical effect of the shrinkage consists in changing the projection plane $v^{⊥ z}$ to $v^{⊥ z^{λ}}$ . In Panel (d), the ‘shrunk’ partial correlation $ρ^{λ}$ between X and Y is the cosine of the angle $β^{λ}$ . That is the cosine between the projections of $\vec{x^{λ}}$ and $\vec{y^{λ}}$ onto $v^{⊥ z^{λ}}$

**Fig. 2.**
Probability densities and P-values under $H_{0}$ . This figure shows a comparison of the standard $f_{0} (ρ^{λ})$ and ‘shrunk’ ${f_{0}}^{λ} (ρ^{λ})$ densities, and their P-values under $H_{0}$ . Here Standard MLE denotes $f_{0} (ρ^{λ})$ (i.e. Equation 4) with k obtained via MLE (as in Shrunk MLE). Panel (a) shows the average histogram of P-values obtained with (i) Standard MLE (dark grey), (ii) Shrunk MLE (light grey) and (iii) MC (white) with 15 iterations. Panel (b) replaces Standard MLE by ENF to estimate $k$ in $f_{0} (ρ^{λ})$ (currently used in *GeneNet*). The bin’s width is set to 0.05; therefore, the first bin represents the amount of significant coefficients at the 5% level. The bin’s height corresponds to the mean over 25 simulations, and the error bars (for ENF) to ±2 SE. It can be seen that the P-values from ENF (dark grey) are not uniformly distributed in [0, 1] under $H_{0}$ . Panels (c) and (d) show the difference $f_{0} (ρ^{λ}) - {f_{0}}^{λ} (ρ^{λ})$ when $k$ is found via MLE or via ENF, respectively (see Section 2.1). Data were simulated with $p = 100$ , $n = 20$ and $λ =$ 0.94. The critical value at $α = 0.05$ (in dashed grey) is estimated by MC (with 100 iterations). It can be seen that $f_{0} (ρ^{λ})$ has larger tails than ${f_{0}}^{λ} (ρ^{λ})$

**Fig. 3.**
False positives. This figure shows the number of FPs obtained with different sample size $n$ . The number of FPs are shown in Panels (a) and (b) under $H_{0}$ : no partial correlation (i.e. the percentages of true correlations $δ$ is zero). Inference is carried out from simulated data for $p$ = 100 and $n$ ranging from 10 to 150 in steps of size 10. The black horizontal line represents the *expected* number of FPs under $H_{0}$ , tested at $α$ = 0.05 and $0.01$ respectively (i.e. 247.5 for $α$ = 0.05 and 49.5 for $α$ =0.01). Panels (c) and (d) show the number of FPs for different proportions $p / n$ when the percentages of non-zero correlations is $δ$ = 0.01. Here $p$ = 50, 100, 150, 200 and $n$ = 20. Three approaches are compared: ENF (dot with dashed line), Shrunk MLE (square with dotted line), and MC with 15 iterations (triangle with continuous line). Symbols (and bars) represent the average (±2 SE) over 25 repeated simulations. The upper horizontal axis shows the average shrinkage intensity $\overset{}{λ}$ rounded to two digits

**Fig. 4.**
FPR cross-comparison. This figure shows a heatmap for the difference in FPR under $H_{0}$ : no partial correlation with respect to the gold standard (MC). The number of variables $p$ range from 10 to 400, and the sample size $n$ from 3 to 100. Panel (a) shows the heatmap for the FPR for ENF minus the FPR for MC, averaged over 10 simulations (rounded to two decimals). Panel (b) show the respective results for Shrunk MLE. The test is carried out at $α$ = 0.05 with a shrinkage value fixed to $λ$ = 0.3. The color scale represents the FPR differences in the $p$ – $n$ grid, where the larger the FPR difference the darker is the corresponding grid cell. In general, Shrunk MLE outperforms ENF, and it is in close agreement with MC for $p > 40$ and $n > 10$

**Fig. 5.**
Positive predictive value. This figure shows the PPV ( $PPV = TP / P$ ) obtained with different sample sizes. The inference is carried out from simulated data for $p$ =100 with $n$ ranging from 10 to 150 in steps of size 10, tested at $α$ = 0.05. The Panels (a) and (b) show the PPV using Benjamini-Hochberg (BH)-adjusted P-values for multiple testing with $δ$ = 0.01 (or 49 correlations) and $δ$ = 0.03 (or 148 correlations). Three approaches are compared: ENF (dot with dashed line), Shrunk MLE (square with dotted line), and MC with 15 iterations (triangle with continuous line). Symbols (and bars) represent the average (±2 SEs) over 25 repeated simulations. The upper horizontal axis shows the average shrinkage intensity $\bar{λ}$ rounded to two digits

**Fig. 6.**
Amount of significant edges related to the induced expression of SOD in *E.coli*. Analysis of *E.coli* gene microarray expression data upon response to induced SOD expression. The dataset includes 102 genes (Schmidt-Heck *et al.*, 2004). The estimator produces an optimal shrinkage $λ$ = 0.18. Three methods are compared at $α$ = 0.05: ENF, Shrunk MLE, MC with 40 iterations. Panel (a) Venn diagram of significant partial correlations (i.e. edges in the GGM) recovered by each method with un-adjusted P-values. Panel (b) Venn diagram of significant partial correlations recovered by each method with BH-adjusted P-values

**Fig. 7.**
Amount of significant edges related to lung samples’ expression in *M.musculus*. Analysis of *M.musculus* RNA-seq expression data from lung samples (Steed *et al.*, 2017). The estimator produces an optimal shrinkage $λ \approx$ 0.11. Three methods are compared at $α = 0.10$ ; ENF, Shrunk MLE and MC (with 40 iterations). Panel (a) Venn diagram of significant partial correlations (i.e. edges in the GGM) recovered by each method with un-adjusted P-values. Panel (b) Venn diagram of significant partial correlations recovered by each method with BH-adjusted P-values

See this image and copyright information in PMC

References

1. Ashburner M. et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet., 25, 25.. - PMC - PubMed
1. Beerenwinkel N. et al. (2007) Genetic progression and the waiting time to cancer. PLoS Comput. Biol., 3, e225.. - PMC - PubMed
1. Benedetti E. et al. (2017) Network inference from glycoproteomics data reveals new reactions in the IgG glycosylation pathway. Nat. Commun., 8, 1483. - PMC - PubMed
1. Benjamini Y., Yosef H. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol., 57, 289–300.
1. Butte A.J., Kohane I.S. (2003) Relevance networks: a first step toward finding genetic regulatory networks within microarray data In: Parmigiani G.et al. (ed.) The Analysis of Gene Expression Data. Springer, New York, NY, pp. 428–446.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Exact hypothesis testing for shrinkage-based Gaussian graphical models

Affiliations

Exact hypothesis testing for shrinkage-based Gaussian graphical models

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases