Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Benhuai Xie¹, Wei Pan, Xiaotong Shen

Affiliations

PMID: 19920875
PMCID: PMC2777718
DOI: 10.1214/08-EJS194

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Benhuai Xie et al. Electron J Stat. 2008.

. 2008:2:168-212.

doi: 10.1214/08-EJS194.

Authors

Benhuai Xie¹, Wei Pan, Xiaotong Shen

Affiliation

¹ Division of Biostatistics, School of Public Health, University of Minnesota, benhuaix@biostat.umn.edu.

PMID: 19920875
PMCID: PMC2777718
DOI: 10.1214/08-EJS194

Abstract

Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery. For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thresholding. Numerical examples, including an application to acute leukemia subtype discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method.

PubMed Disclaimer

Figures

**Fig 1**
Scatter plots of the estimated means and variances by the new penalized method. Panels a)–c) are scatter plots of the estimated variances in cluster 1 versus those in cluster 2, 3 and 4, respectively; panels d)–g) are the scatter plots of the estimated means versus estimated variances for the four clusters respectively.

**Fig 2**
Observed expression levels of two pairs of genes and the corresponding clusters found by the two penalized methods.

**Fig 3**
Penalized mean and variance estimates for cluster one containing the 11 ALL B-cell samples by the new penalized method.

**Fig 4**
Agglomerative hierarchical clustering results for the 38 leukemia samples: the first 8 samples were T-cell ALL; samples 9–27 were B-cell ALL; the remaining ones were AML.

**Fig 5**
Comparison of the two regularization schemes on the variance parameters for one dataset of set-up 3. σ̂_is is MPLE for cluster i by scheme s.

**Fig 6**
Comparison of the two regularization schemes on the variance parameters for Golub’s data with the top 2000 genes. X-axis and y-axis give the MPLEs by scheme 1 and scheme 2 respectively.

**Fig 7**
Comparison of the penalized variance estimates by regularization scheme 1 and the sample variances for Golub’s data with the top 2000 genes.

**Fig 8**
Comparison of the penalized variance estimates by regularization scheme 2 and the sample variances for Golub’s data with the top 2000 genes.

See this image and copyright information in PMC

Cited by

Meta-analytic framework for sparse K-means to identify disease subtypes in multiple transcriptomic studies.
Huo Z, Ding Y, Liu S, Oesterreich S, Tseng G. Huo Z, et al. J Am Stat Assoc. 2016;111(513):27-42. doi: 10.1080/01621459.2015.1086354. Epub 2016 May 5. J Am Stat Assoc. 2016. PMID: 27330233 Free PMC article.
Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty.
Pan W, Shen X, Liu B. Pan W, et al. J Mach Learn Res. 2013 Jul 1;14(7):1865. J Mach Learn Res. 2013. PMID: 24358018 Free PMC article.
A framework for feature selection in clustering.
Witten DM, Tibshirani R. Witten DM, et al. J Am Stat Assoc. 2010 Jun 1;105(490):713-726. doi: 10.1198/jasa.2010.tm09415. J Am Stat Assoc. 2010. PMID: 20811510 Free PMC article.
Estimation of multiple networks in Gaussian mixture models.
Gao C, Zhu Y, Shen X, Pan W. Gao C, et al. Electron J Stat. 2016;10:1133-1154. doi: 10.1214/16-EJS1135. Epub 2016 May 2. Electron J Stat. 2016. PMID: 28966702 Free PMC article.
Discovering a sparse set of pairwise discriminating features in high-dimensional data.
Melton S, Ramanathan S. Melton S, et al. Bioinformatics. 2021 Apr 19;37(2):202-212. doi: 10.1093/bioinformatics/btaa690. Bioinformatics. 2021. PMID: 32730566 Free PMC article.

See all "Cited by" articles

References

1. Alaiya AA, et al. Molecular classification of borderline ovarian tumors using hierarchical cluster analysis of protein expression profiles. Int J Cancer. 2002;98:895–899. - PubMed
1. Antonov AV, Tetko IV, Mader MT, Budczies J, Mewes HW. Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics. 2004;20:644–652. - PubMed
1. Baker Stuart G, Kramer Barnett S. Identifying genes that contribute most to good classification in microarrays. BMC Bioinformatics. 2006 Sep 7;7:407. - PMC - PubMed
1. Bardi E, Bobok I, Olah AV, Olah E, Kappelmayer J, Kiss C. Cystatin C is a suitable marker of glomerular function in children with cancer. Pediatric Nephrology. 2004;19:1145–1147. - PubMed
1. Bickel PJ, Levina E. Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010. MR2108040.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Affiliation

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources