Overfitting Bayesian Mixture Models with an Unknown Number of Components

doi:10.1371/journal.pone.0131739

. 2015 Jul 15;10(7):e0131739.

doi: 10.1371/journal.pone.0131739. eCollection 2015.

Overfitting Bayesian Mixture Models with an Unknown Number of Components

Zoé van Havre¹, Nicole White², Judith Rousseau³, Kerrie Mengersen³

Affiliations

¹ School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia; CEREMADE, Université Paris Dauphine, Paris, France.
² School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia.
³ CEREMADE, Université Paris Dauphine, Paris, France.

PMID: 26177375
PMCID: PMC4503697
DOI: 10.1371/journal.pone.0131739

Overfitting Bayesian Mixture Models with an Unknown Number of Components

Zoé van Havre et al. PLoS One. 2015.

. 2015 Jul 15;10(7):e0131739.

doi: 10.1371/journal.pone.0131739. eCollection 2015.

Authors

Zoé van Havre¹, Nicole White², Judith Rousseau³, Kerrie Mengersen³

Affiliations

¹ School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia; CEREMADE, Université Paris Dauphine, Paris, France.
² School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia.
³ CEREMADE, Université Paris Dauphine, Paris, France.

PMID: 26177375
PMCID: PMC4503697
DOI: 10.1371/journal.pone.0131739

Abstract

This paper proposes solutions to three issues pertaining to the estimation of finite mixture models with an unknown number of components: the non-identifiability induced by overfitting the number of components, the mixing limitations of standard Markov Chain Monte Carlo (MCMC) sampling techniques, and the related label switching problem. An overfitting approach is used to estimate the number of components in a finite mixture model via a Zmix algorithm. Zmix provides a bridge between multidimensional samplers and test based estimation methods, whereby priors are chosen to encourage extra groups to have weights approaching zero. MCMC sampling is made possible by the implementation of prior parallel tempering, an extension of parallel tempering. Zmix can accurately estimate the number of components, posterior parameter estimates and allocation probabilities given a sufficiently large sample size. The results will reflect uncertainty in the final model and will report the range of possible candidate models and their respective estimated probabilities from a single run. Label switching is resolved with a computationally light-weight method, Zswitch, developed for overfitted mixtures by exploiting the intuitiveness of allocation-based relabelling algorithms and the precision of label-invariant loss functions. Four simulation studies are included to illustrate Zmix and Zswitch, as well as three case studies from the literature. All methods are available as part of the R package Zmix, which can currently be applied to univariate Gaussian mixture models.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Fig 1. Description of the four simulations considered in this paper.**
Density plots of the mixture distributions are indicated by a dashed line, and histograms of a single realisation of each simulation (with n = 200) are included.

**Fig 2. Number of alive (non-empty) groups K¯0j for each chain j for Sim 2.**
Results are shown for Sim 2, n = 100 (left) and n = 200 (right). Boxplots of the number of non-empty groups ${\underline{K}}_{0}^{j}$ for each chain j are included; each chain represents posterior samples from the Zmix sampler with the hyperparameter α ^j on the mixture weights, the value of which is included in red for each j.

**Fig 3. Sim 2, n = 200, 𝓚_k₀ = 3.**
Results of Zmix and Zswitch including, from top left to right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.

**Fig 4. Summary of the results for Sim 2 and n = 100.**
The first two rows of plots refer to 𝓚_k₀ = 2, and the lower set refers to 𝓚_k₀ = 3. Results of Zmix and Zswitch are presented including, from upper left to lower right of each set: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior. A panel of plots is included for each candidate model found by Zmix.

**Fig 5. Overfitting the *Acidity* dataset.**
Results of Zmix and Zswitch including from upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data over the densities of 10,000 predicted datasets of the same size from the posterior.

**Fig 6. Overfitting the *Enzyme* dataset.**
Results of Zmix and Zswitch including, from upper left to lower right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a posterior predictive density plot of 10,000 replicates with the density of the data represented as a dashed line.

**Fig 7. Overfitting the *Galaxy* dataset.**
Results of Zmix and Zswitch including, from top left to right: the posterior parameter densities of all parameters from estimated groups, the posterior probability of allocations for each observation for each component, and a density plot of the data overlaid over the densities of 10,000 predicted datasets of the same size from the posterior.

See this image and copyright information in PMC

Cited by

Disentangling Qualitatively Different Faking Strategies in High-Stakes Personality Assessments: A Mixture Extension of the Multidimensional Nominal Response Model.
Seitz T, Alagöz ÖEC, Meiser T. Seitz T, et al. Educ Psychol Meas. 2025 Jul 29:00131644251341843. doi: 10.1177/00131644251341843. Online ahead of print. Educ Psychol Meas. 2025. PMID: 40756699 Free PMC article.
Identifying dietary consumption patterns from survey data: a Bayesian nonparametric latent class model.
Stephenson BJK, Wu SM, Dominici F. Stephenson BJK, et al. J R Stat Soc Ser A Stat Soc. 2023 Dec 12;187(2):496-512. doi: 10.1093/jrsssa/qnad135. eCollection 2024 Apr. J R Stat Soc Ser A Stat Soc. 2023. PMID: 38617597 Free PMC article.
PyClone-VI: scalable inference of clonal population structures using whole genome data.
Gillis S, Roth A. Gillis S, et al. BMC Bioinformatics. 2020 Dec 10;21(1):571. doi: 10.1186/s12859-020-03919-2. BMC Bioinformatics. 2020. PMID: 33302872 Free PMC article.
Robust Clustering with Subpopulation-specific Deviations.
Stephenson BJK, Herring AH, Olshan A. Stephenson BJK, et al. J Am Stat Assoc. 2020;115(530):521-537. doi: 10.1080/01621459.2019.1611583. Epub 2019 Jun 19. J Am Stat Assoc. 2020. PMID: 32952235 Free PMC article.
Empirically Derived Dietary Patterns Using Robust Profile Clustering in the Hispanic Community Health Study/Study of Latinos.
Stephenson BJK, Sotres-Alvarez D, Siega-Riz AM, Mossavar-Rahmani Y, Daviglus ML, Van Horn L, Herring AH, Cai J. Stephenson BJK, et al. J Nutr. 2020 Oct 12;150(10):2825-2834. doi: 10.1093/jn/nxaa208. J Nutr. 2020. PMID: 32710754 Free PMC article.

See all "Cited by" articles

References

1. Fruhwirth-Schnatter SI. Finite mixture and Markov switching models. 1st ed Springer; 2006. PLOS 20/232.
1. Lewin A, Bochkina N, Richardson S. Fully Bayesian mixture model for differential gene expression: simulations and model checks. Statistical applications in genetics and molecular biology. 2007. January;6:Article36. 10.2202/1544-6115.1314 - DOI - PubMed
1. Ferreira da Silva AR. A Dirichlet process mixture model for brain MRI tissue classification. Medical image analysis. 2007;11(2):169–182. 10.1016/j.media.2006.12.002 - DOI - PubMed
1. White N, Johnson H, Silburn P, Mellick G, Dissanayaka N, Mengersen K. Probabilistic subgroup identification using Bayesian finite mixture modelling: A case study in Parkinson’s disease phenotype identification. Statistical methods in medical research. 2010 Dec;. - PubMed
1. Heckman JJ, Taber CR. Econometric mixture models and more general models for unobservables in duration analysis Statistical Methods in Medical Research. 1994;3(3):279–299. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Fruhwirth-Schnatter SI. Finite mixture and Markov switching models. 1st ed Springer; 2006. PLOS 20/232.

[2] Fruhwirth-Schnatter SI. Finite mixture and Markov switching models. 1st ed Springer; 2006. PLOS 20/232.

[3] Lewin A, Bochkina N, Richardson S. Fully Bayesian mixture model for differential gene expression: simulations and model checks. Statistical applications in genetics and molecular biology. 2007. January;6:Article36. 10.2202/1544-6115.1314 - DOI - PubMed

[4] Lewin A, Bochkina N, Richardson S. Fully Bayesian mixture model for differential gene expression: simulations and model checks. Statistical applications in genetics and molecular biology. 2007. January;6:Article36. 10.2202/1544-6115.1314 - DOI - PubMed

[5] Ferreira da Silva AR. A Dirichlet process mixture model for brain MRI tissue classification. Medical image analysis. 2007;11(2):169–182. 10.1016/j.media.2006.12.002 - DOI - PubMed

[6] Ferreira da Silva AR. A Dirichlet process mixture model for brain MRI tissue classification. Medical image analysis. 2007;11(2):169–182. 10.1016/j.media.2006.12.002 - DOI - PubMed

[7] White N, Johnson H, Silburn P, Mellick G, Dissanayaka N, Mengersen K. Probabilistic subgroup identification using Bayesian finite mixture modelling: A case study in Parkinson’s disease phenotype identification. Statistical methods in medical research. 2010 Dec;. - PubMed

[8] White N, Johnson H, Silburn P, Mellick G, Dissanayaka N, Mengersen K. Probabilistic subgroup identification using Bayesian finite mixture modelling: A case study in Parkinson’s disease phenotype identification. Statistical methods in medical research. 2010 Dec;. - PubMed

[9] Heckman JJ, Taber CR. Econometric mixture models and more general models for unobservables in duration analysis Statistical Methods in Medical Research. 1994;3(3):279–299. - PubMed

[10] Heckman JJ, Taber CR. Econometric mixture models and more general models for unobservables in duration analysis Statistical Methods in Medical Research. 1994;3(3):279–299. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Overfitting Bayesian Mixture Models with an Unknown Number of Components

Affiliations

Overfitting Bayesian Mixture Models with an Unknown Number of Components

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources