Shrinkage improves estimation of microbial associations under different normalization methods

doi:10.1093/nargab/lqaa100

. 2020 Dec 17;2(4):lqaa100.

doi: 10.1093/nargab/lqaa100. eCollection 2020 Dec.

Shrinkage improves estimation of microbial associations under different normalization methods

Michelle Badri¹, Zachary D Kurtz², Richard Bonneau¹, Christian L Müller³

Affiliations

¹ Department of Biology, New York University, New York, NY 10012, USA.
² Lodo Therapeutics, New York, NY 10016, USA.
³ Center for Computational Mathematics, Flatiron Institute, Simons Foundation, New York, NY 10010, USA.

PMID: 33575644
PMCID: PMC7745771
DOI: 10.1093/nargab/lqaa100

Shrinkage improves estimation of microbial associations under different normalization methods

Michelle Badri et al. NAR Genom Bioinform. 2020.

. 2020 Dec 17;2(4):lqaa100.

doi: 10.1093/nargab/lqaa100. eCollection 2020 Dec.

Authors

Michelle Badri¹, Zachary D Kurtz², Richard Bonneau¹, Christian L Müller³

Affiliations

¹ Department of Biology, New York University, New York, NY 10012, USA.
² Lodo Therapeutics, New York, NY 10016, USA.
³ Center for Computational Mathematics, Flatiron Institute, Simons Foundation, New York, NY 10010, USA.

PMID: 33575644
PMCID: PMC7745771
DOI: 10.1093/nargab/lqaa100

Abstract

Estimation of statistical associations in microbial genomic survey count data is fundamental to microbiome research. Experimental limitations, including count compositionality, low sample sizes and technical variability, obstruct standard application of association measures and require data normalization prior to statistical estimation. Here, we investigate the interplay between data normalization, microbial association estimation and available sample size by leveraging the large-scale American Gut Project (AGP) survey data. We analyze the statistical properties of two prominent linear association estimators, correlation and proportionality, under different sample scenarios and data normalization schemes, including RNA-seq analysis workflows and log-ratio transformations. We show that shrinkage estimation, a standard statistical regularization technique, can universally improve the quality of taxon-taxon association estimates for microbiome data. We find that large-scale association patterns in the AGP data can be grouped into five normalization-dependent classes. Using microbial association network construction and clustering as downstream data analysis examples, we show that variance-stabilizing and log-ratio approaches enable the most taxonomically and structurally coherent estimates. Taken together, the findings from our reproducible analysis workflow have important implications for microbiome studies in multiple stages of analysis, particularly when only small sample sizes are available.

PubMed Disclaimer

Figures

**Figure 1.**
Framework for examining the effects of normalization methods on linear association estimation with increasing sample size. Comparative summary statistics of the resulting association matrices include distribution-based analysis, distance-based matrix comparison, hierarchical clustering and association network analysis.

**Figure 2.**
Frobenius distance between estimates of association. (A) Average Frobenius distance between subsamples of the same sample size. Dashed lines represent the mean distance between normalized matrices after Pearson correlation. The solid lines represent the mean distance between normalized matrices where correlation/proportionality estimation with shrinkage was performed. The dot-dashed line represents rho, a proportionality metric. The long-dashed line represents rhoshrink, proportionality with shrinkage included. Vertical lines represent standard deviation from the mean for each corresponding method. (B) Multidimensional scaling (MDS) representation of Frobenius distance between correlation structures of varying sizes estimated from different normalization methods. The most opaque points represent the mean of five subsamples of the same size [color scheme as in (A)]. Points are labeled based on subsample size.

**Figure 3.**
Density of association values under different transformations and shrinkage. To represent clr and tss, data are normalized and correlation is calculated with shrinkage. Proportionality without shrinkage and proportionality with shrinkage are represented by rhoprop and rhoshrink, respectively. Each plot is a single random subsample of four representative methods at (A) 50 samples, (B) 50 samples with shuffled data and (C) 9000 samples. Mean, variance, skewness and kurtosis are shown for each distribution. Additional methods are provided in Supplementary Figure S3.

**Figure 4.**
OTU clusters from spectral clustering. (A–D) Each horizontal bar represents the composition of OTUs in a cluster at the family level. Clusters are in order of increasing percentage of the most abundant family: Ruminococcaceae. In each cluster, the colors represent the OTU families in each cluster. Numbers to the left of each bar represent the number of OTUs in each cluster. Values next to each method name represent cluster purity. Additional methods are provided in Supplementary Figure S7.

**Figure 5.**
Circular dendrograms showing hierarchical clustering patterns among OTUs. Each point surrounding the circular dendrogram represents one of the 531 OTUs in our dataset. The color represents family annotation. Each dendrogram (A–D) has been cut hierarchically into 10 trees (representing the 10 orders to which these taxonomic families map). The gray and black shading is used to highlight different clusters that are numbered. Hierarchical clustering of clr-transformed OTUs is better at delineating taxonomic relationships than clustering of those using tss; rhoprop and rhoshrink produce similar clustering patterns. Additional methods are provided in Supplementary Figure S8.

**Figure 6.**
Community structure of relevance networks. (A–D) The left network of each panel shows module membership. Each numbered node represents the module annotation of an OTU in the graph. The networks on the right represent the corresponding taxonomic annotation of the OTU at the family (color) and phylum (shape) levels. Values stated next to method name represent the number of modules in the network. Node layout is conserved for both networks in each panel. Additional methods are provided in Supplementary Figure S9.

**Figure 7.**
Community analysis of relevance network structure with increasing sample size. (A) Assortativity coefficient across sample size of genus annotation. (B) Maximum modularity score across sample size at 2000 edges. For all plots, lines represent mean and gray ribbons represent standard deviation from the mean.

**Figure 8.**
Shared interactions between relevance networks. (A) Consensus network of edges in common between four representative methods. Network contains 1086 edges between 346 OTUs. Node color represents family annotation and node shape represents phylum. (B) Venn diagram showing unique and shared interactions predicted from representative normalization methods.

See this image and copyright information in PMC

Cited by

Tree-aggregated predictive modeling of microbiome data.
Bien J, Yan X, Simpson L, Müller CL. Bien J, et al. Sci Rep. 2021 Jul 15;11(1):14505. doi: 10.1038/s41598-021-93645-3. Sci Rep. 2021. PMID: 34267244 Free PMC article.
Metagenomic study of the gut microbiota associated with cow milk consumption in Chinese peri-/postmenopausal women.
Tian B, Yao JH, Lin X, Lv WQ, Jiang LD, Wang ZQ, Shen J, Xiao HM, Xu H, Xu LL, Cheng X, Shen H, Qiu C, Luo Z, Zhao LJ, Yan Q, Deng HW, Zhang LS. Tian B, et al. Front Microbiol. 2022 Aug 16;13:957885. doi: 10.3389/fmicb.2022.957885. eCollection 2022. Front Microbiol. 2022. PMID: 36051762 Free PMC article.
Poisson hurdle model-based method for clustering microbiome features.
Qiao Z, Barnes E, Tringe S, Schachtman DP, Liu P. Qiao Z, et al. Bioinformatics. 2023 Jan 1;39(1):btac782. doi: 10.1093/bioinformatics/btac782. Bioinformatics. 2023. PMID: 36469352 Free PMC article.
Is There a Universal Endurance Microbiota?
Olbricht H, Twadell K, Sandel B, Stephens C, Whittall JB. Olbricht H, et al. Microorganisms. 2022 Nov 9;10(11):2213. doi: 10.3390/microorganisms10112213. Microorganisms. 2022. PMID: 36363806 Free PMC article.
Bacterial low-abundant taxa are key determinants of a healthy airway metagenome in the early years of human life.
Pust MM, Tümmler B. Pust MM, et al. Comput Struct Biotechnol J. 2021 Dec 15;20:175-186. doi: 10.1016/j.csbj.2021.12.008. eCollection 2022. Comput Struct Biotechnol J. 2021. PMID: 35024091 Free PMC article.

See all "Cited by" articles

References

1. Caporaso J.G., Kuczynski J., Stombaugh J., Bittinger K., Bushman F.D., Costello E.K., Fierer N., Peña A.G., Goodrich J.K., Gordon J.I. et al. . QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010; 7:335–336. - PMC - PubMed
1. Schloss P.D., Westcott S.L., Ryabin T., Hall J.R., Hartmann M., Hollister E.B., Lesniewski R.A., Oakley B.B., Parks D.H., Robinson C.J. et al. . Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009; 75:7537–7541. - PMC - PubMed
1. Callahan B.J., McMurdie P.J., Rosen M.J., Han A.W., Johnson A.J.A., Holmes S.P. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods. 2016; 13:581–583. - PMC - PubMed
1. Willis A.D., Martin B.D. Estimating diversity in networked ecologicalcommunities. Biostatistics. 2020; doi:10.1093/biostatistics/kxaa015. - PMC - PubMed
1. Bucci V., Tzen B., Li N., Simmons M., Tanoue T., Bogart E., Deng L., Yeliseyev V., Delaney M.L., Liu Q. et al. . MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome Biol. 2016; 17:121. - PMC - PubMed

LinkOut - more resources

Full Text Sources

[1] Caporaso J.G., Kuczynski J., Stombaugh J., Bittinger K., Bushman F.D., Costello E.K., Fierer N., Peña A.G., Goodrich J.K., Gordon J.I. et al. . QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010; 7:335–336. - PMC - PubMed

[2] Caporaso J.G., Kuczynski J., Stombaugh J., Bittinger K., Bushman F.D., Costello E.K., Fierer N., Peña A.G., Goodrich J.K., Gordon J.I. et al. . QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010; 7:335–336. - PMC - PubMed

[3] Schloss P.D., Westcott S.L., Ryabin T., Hall J.R., Hartmann M., Hollister E.B., Lesniewski R.A., Oakley B.B., Parks D.H., Robinson C.J. et al. . Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009; 75:7537–7541. - PMC - PubMed

[4] Schloss P.D., Westcott S.L., Ryabin T., Hall J.R., Hartmann M., Hollister E.B., Lesniewski R.A., Oakley B.B., Parks D.H., Robinson C.J. et al. . Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009; 75:7537–7541. - PMC - PubMed

[5] Callahan B.J., McMurdie P.J., Rosen M.J., Han A.W., Johnson A.J.A., Holmes S.P. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods. 2016; 13:581–583. - PMC - PubMed

[6] Callahan B.J., McMurdie P.J., Rosen M.J., Han A.W., Johnson A.J.A., Holmes S.P. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods. 2016; 13:581–583. - PMC - PubMed

[7] Willis A.D., Martin B.D. Estimating diversity in networked ecologicalcommunities. Biostatistics. 2020; doi:10.1093/biostatistics/kxaa015. - PMC - PubMed

[8] Willis A.D., Martin B.D. Estimating diversity in networked ecologicalcommunities. Biostatistics. 2020; doi:10.1093/biostatistics/kxaa015. - PMC - PubMed

[9] Bucci V., Tzen B., Li N., Simmons M., Tanoue T., Bogart E., Deng L., Yeliseyev V., Delaney M.L., Liu Q. et al. . MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome Biol. 2016; 17:121. - PMC - PubMed

[10] Bucci V., Tzen B., Li N., Simmons M., Tanoue T., Bogart E., Deng L., Yeliseyev V., Delaney M.L., Liu Q. et al. . MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome Biol. 2016; 17:121. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Shrinkage improves estimation of microbial associations under different normalization methods

Affiliations

Shrinkage improves estimation of microbial associations under different normalization methods

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources