Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar 1;7(1):10.1214/12-AOAS592.
doi: 10.1214/12-AOAS592.

VARIABLE SELECTION FOR SPARSE DIRICHLET-MULTINOMIAL REGRESSION WITH AN APPLICATION TO MICROBIOME DATA ANALYSIS

Affiliations

VARIABLE SELECTION FOR SPARSE DIRICHLET-MULTINOMIAL REGRESSION WITH AN APPLICATION TO MICROBIOME DATA ANALYSIS

Jun Chen et al. Ann Appl Stat. .

Abstract

With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of the covariates is large, multiple testing can lead to loss of power. To deal with the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group [Formula: see text] penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.

Keywords: Coordinate descent; Counts data; Overdispersion; Regularized likelihood; Sparse group penalty.

PubMed Disclaimer

Figures

Fig 1
Fig 1. Effect of the tuning parameter c on variable selection
The tuning parameter c is varied from 0 to 0.4. Under each value of c, the best λ value, which maximizes the likelihood of the test data set, is selected to generate the sparse model. Group (left) and within-group (right) selection performance are then evaluated using measures of recall, precision and F1 based on 100 replications. Simulation setting: n = 100, p = 100, pr = 4, q = 40, qr = 4, m = 500, θ = 0.025, ρ = 0.4.
Fig 2
Fig 2
Effects of overdispersion (top panel) and model-misspecification (bottom panel) on the performance of three different models and methods. DM-SGL: sparse group 1 penalized Dirichlet-multinomial model; DM-L: 1 penalized Dirichlet-multinomial model; M-SGL: sparse group 1 penalized multinomial model; M-L: 1 penalized multinomial model; D-SGL: sparse group 1 penalized Dirichlet model; D-L: 1 penalized Dirichlet model. For each bar, mean±standard error is presented based on 100 replications.
Fig 3
Fig 3
Effects of the number of relevant taxa (top panel) and the number of the covariates (bottom panel) on the performances of several models and methods. DM-SGL: sparse group 1 penalized Dirichlet-multinomial model; DM-L: 1 penalized Dirichlet-multinomial model; M-SGL: sparse group 1 penalized multinomial model; M-L: 1 penalized multinomial model; D-SGL: sparse group 1 penalized Dirichlet model; D-L: 1 penalized Dirichlet model. For each bar, mean±standard error is presented based on 100 replications.
Fig 4
Fig 4
Model fit using the variables selected by the sparse group l1 penalized DM model. Top plot: square root of the fitted counts versus square root of the observed counts based on the DM model with the selected nutrients; bottom plots: Observed counts and simulated counts produced by the fitted sparse DM model and multinomial model.
Fig 5
Fig 5
Association of nutrients with human gut microbial taxa identified by the sparse group 1 regularized DM model. We use a bipartite graph to visualize the selected nutrients and their associated genera based on sparse group 1 penalized DM regression. Circle: genus; hexagon: nutrient; solid line: positive correlation; dashed line: negative correlation. The thickness of the line represent the association strength.

References

    1. Aitchison J. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B. 1982;44:139–177.
    1. Bach FR. Bolasso: Model consistent Lasso estimation through the bootstrap.. ICML ’08: Proceedings of the 25th international conference on Machine learning..2008.
    1. Bäckhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI. Host-bacterial mutualism in the human intestine. Science. 2005;307:1915–1920. - PubMed
    1. Barry S, Welsh A. Generalized additive modelling and zero inflated count data. Ecological Modelling. 2002;157:179–188.
    1. Benson AK, Kelly SA, Legge R, Ma F, Low SJ, Kim J, Zhang M, Oh PL, Nehrenberg D, Hua K, et al. Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors. Proceedings of the National Academy of Sciences of the United States of America. 2010;107:18933–18938. - PMC - PubMed

LinkOut - more resources