Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 17;2(4):lqaa100.
doi: 10.1093/nargab/lqaa100. eCollection 2020 Dec.

Shrinkage improves estimation of microbial associations under different normalization methods

Affiliations

Shrinkage improves estimation of microbial associations under different normalization methods

Michelle Badri et al. NAR Genom Bioinform. .

Abstract

Estimation of statistical associations in microbial genomic survey count data is fundamental to microbiome research. Experimental limitations, including count compositionality, low sample sizes and technical variability, obstruct standard application of association measures and require data normalization prior to statistical estimation. Here, we investigate the interplay between data normalization, microbial association estimation and available sample size by leveraging the large-scale American Gut Project (AGP) survey data. We analyze the statistical properties of two prominent linear association estimators, correlation and proportionality, under different sample scenarios and data normalization schemes, including RNA-seq analysis workflows and log-ratio transformations. We show that shrinkage estimation, a standard statistical regularization technique, can universally improve the quality of taxon-taxon association estimates for microbiome data. We find that large-scale association patterns in the AGP data can be grouped into five normalization-dependent classes. Using microbial association network construction and clustering as downstream data analysis examples, we show that variance-stabilizing and log-ratio approaches enable the most taxonomically and structurally coherent estimates. Taken together, the findings from our reproducible analysis workflow have important implications for microbiome studies in multiple stages of analysis, particularly when only small sample sizes are available.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Framework for examining the effects of normalization methods on linear association estimation with increasing sample size. Comparative summary statistics of the resulting association matrices include distribution-based analysis, distance-based matrix comparison, hierarchical clustering and association network analysis.
Figure 2.
Figure 2.
Frobenius distance between estimates of association. (A) Average Frobenius distance between subsamples of the same sample size. Dashed lines represent the mean distance between normalized matrices after Pearson correlation. The solid lines represent the mean distance between normalized matrices where correlation/proportionality estimation with shrinkage was performed. The dot-dashed line represents rho, a proportionality metric. The long-dashed line represents rhoshrink, proportionality with shrinkage included. Vertical lines represent standard deviation from the mean for each corresponding method. (B) Multidimensional scaling (MDS) representation of Frobenius distance between correlation structures of varying sizes estimated from different normalization methods. The most opaque points represent the mean of five subsamples of the same size [color scheme as in (A)]. Points are labeled based on subsample size.
Figure 3.
Figure 3.
Density of association values under different transformations and shrinkage. To represent clr and tss, data are normalized and correlation is calculated with shrinkage. Proportionality without shrinkage and proportionality with shrinkage are represented by rhoprop and rhoshrink, respectively. Each plot is a single random subsample of four representative methods at (A) 50 samples, (B) 50 samples with shuffled data and (C) 9000 samples. Mean, variance, skewness and kurtosis are shown for each distribution. Additional methods are provided in Supplementary Figure S3.
Figure 4.
Figure 4.
OTU clusters from spectral clustering. (AD) Each horizontal bar represents the composition of OTUs in a cluster at the family level. Clusters are in order of increasing percentage of the most abundant family: Ruminococcaceae. In each cluster, the colors represent the OTU families in each cluster. Numbers to the left of each bar represent the number of OTUs in each cluster. Values next to each method name represent cluster purity. Additional methods are provided in Supplementary Figure S7.
Figure 5.
Figure 5.
Circular dendrograms showing hierarchical clustering patterns among OTUs. Each point surrounding the circular dendrogram represents one of the 531 OTUs in our dataset. The color represents family annotation. Each dendrogram (AD) has been cut hierarchically into 10 trees (representing the 10 orders to which these taxonomic families map). The gray and black shading is used to highlight different clusters that are numbered. Hierarchical clustering of clr-transformed OTUs is better at delineating taxonomic relationships than clustering of those using tss; rhoprop and rhoshrink produce similar clustering patterns. Additional methods are provided in Supplementary Figure S8.
Figure 6.
Figure 6.
Community structure of relevance networks. (AD) The left network of each panel shows module membership. Each numbered node represents the module annotation of an OTU in the graph. The networks on the right represent the corresponding taxonomic annotation of the OTU at the family (color) and phylum (shape) levels. Values stated next to method name represent the number of modules in the network. Node layout is conserved for both networks in each panel. Additional methods are provided in Supplementary Figure S9.
Figure 7.
Figure 7.
Community analysis of relevance network structure with increasing sample size. (A) Assortativity coefficient across sample size of genus annotation. (B) Maximum modularity score across sample size at 2000 edges. For all plots, lines represent mean and gray ribbons represent standard deviation from the mean.
Figure 8.
Figure 8.
Shared interactions between relevance networks. (A) Consensus network of edges in common between four representative methods. Network contains 1086 edges between 346 OTUs. Node color represents family annotation and node shape represents phylum. (B) Venn diagram showing unique and shared interactions predicted from representative normalization methods.

Similar articles

Cited by

References

    1. Caporaso J.G., Kuczynski J., Stombaugh J., Bittinger K., Bushman F.D., Costello E.K., Fierer N., Peña A.G., Goodrich J.K., Gordon J.I. et al. . QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010; 7:335–336. - PMC - PubMed
    1. Schloss P.D., Westcott S.L., Ryabin T., Hall J.R., Hartmann M., Hollister E.B., Lesniewski R.A., Oakley B.B., Parks D.H., Robinson C.J. et al. . Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009; 75:7537–7541. - PMC - PubMed
    1. Callahan B.J., McMurdie P.J., Rosen M.J., Han A.W., Johnson A.J.A., Holmes S.P. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods. 2016; 13:581–583. - PMC - PubMed
    1. Willis A.D., Martin B.D. Estimating diversity in networked ecologicalcommunities. Biostatistics. 2020; doi:10.1093/biostatistics/kxaa015. - PMC - PubMed
    1. Bucci V., Tzen B., Li N., Simmons M., Tanoue T., Bogart E., Deng L., Yeliseyev V., Delaney M.L., Liu Q. et al. . MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome Biol. 2016; 17:121. - PMC - PubMed