Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jul 15;10(7):e0131739.
doi: 10.1371/journal.pone.0131739. eCollection 2015.

Overfitting Bayesian Mixture Models with an Unknown Number of Components

Affiliations

Overfitting Bayesian Mixture Models with an Unknown Number of Components

Zoé van Havre et al. PLoS One. .

Abstract

This paper proposes solutions to three issues pertaining to the estimation of finite mixture models with an unknown number of components: the non-identifiability induced by overfitting the number of components, the mixing limitations of standard Markov Chain Monte Carlo (MCMC) sampling techniques, and the related label switching problem. An overfitting approach is used to estimate the number of components in a finite mixture model via a Zmix algorithm. Zmix provides a bridge between multidimensional samplers and test based estimation methods, whereby priors are chosen to encourage extra groups to have weights approaching zero. MCMC sampling is made possible by the implementation of prior parallel tempering, an extension of parallel tempering. Zmix can accurately estimate the number of components, posterior parameter estimates and allocation probabilities given a sufficiently large sample size. The results will reflect uncertainty in the final model and will report the range of possible candidate models and their respective estimated probabilities from a single run. Label switching is resolved with a computationally light-weight method, Zswitch, developed for overfitted mixtures by exploiting the intuitiveness of allocation-based relabelling algorithms and the precision of label-invariant loss functions. Four simulation studies are included to illustrate Zmix and Zswitch, as well as three case studies from the literature. All methods are available as part of the R package Zmix, which can currently be applied to univariate Gaussian mixture models.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Description of the four simulations considered in this paper.
Density plots of the mixture distributions are indicated by a dashed line, and histograms of a single realisation of each simulation (with n = 200) are included.
Fig 2
Fig 2. Number of alive (non-empty) groups K¯0j for each chain j for Sim 2.
Results are shown for Sim 2, n = 100 (left) and n = 200 (right). Boxplots of the number of non-empty groups K¯0j for each chain j are included; each chain represents posterior samples from the Zmix sampler with the hyperparameter α j on the mixture weights, the value of which is included in red for each j.
Fig 3
Fig 3. Sim 2, n = 200, 𝓚k0 = 3.
Results of Zmix and Zswitch including, from top left to right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.
Fig 4
Fig 4. Summary of the results for Sim 2 and n = 100.
The first two rows of plots refer to 𝓚k0 = 2, and the lower set refers to 𝓚k0 = 3. Results of Zmix and Zswitch are presented including, from upper left to lower right of each set: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior. A panel of plots is included for each candidate model found by Zmix.
Fig 5
Fig 5. Overfitting the Acidity dataset.
Results of Zmix and Zswitch including from upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.
Fig 6
Fig 6. Overfitting the Enzyme dataset.
Results of Zmix and Zswitch including, from upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a posterior predictive density plot of 10,000 replicates with the density of the data represented as a dashed line.
Fig 7
Fig 7. Overfitting the Galaxy dataset.
Results of Zmix and Zswitch including, from top left to right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data overlaid over the densities of 10,000 predicted datasets of the same size from the posterior.

Similar articles

Cited by

References

    1. Fruhwirth-Schnatter SI. Finite mixture and Markov switching models. 1st ed Springer; 2006. PLOS 20/232.
    1. Lewin A, Bochkina N, Richardson S. Fully Bayesian mixture model for differential gene expression: simulations and model checks. Statistical applications in genetics and molecular biology. 2007. January;6:Article36. 10.2202/1544-6115.1314 - DOI - PubMed
    1. Ferreira da Silva AR. A Dirichlet process mixture model for brain MRI tissue classification. Medical image analysis. 2007;11(2):169–182. 10.1016/j.media.2006.12.002 - DOI - PubMed
    1. White N, Johnson H, Silburn P, Mellick G, Dissanayaka N, Mengersen K. Probabilistic subgroup identification using Bayesian finite mixture modelling: A case study in Parkinson’s disease phenotype identification. Statistical methods in medical research. 2010 Dec;. - PubMed
    1. Heckman JJ, Taber CR. Econometric mixture models and more general models for unobservables in duration analysis Statistical Methods in Medical Research. 1994;3(3):279–299. - PubMed

Publication types

LinkOut - more resources