Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008:2:168-212.
doi: 10.1214/08-EJS194.

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Affiliations

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables

Benhuai Xie et al. Electron J Stat. 2008.

Abstract

Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery. For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thresholding. Numerical examples, including an application to acute leukemia subtype discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method.

PubMed Disclaimer

Figures

Fig 1
Fig 1
Scatter plots of the estimated means and variances by the new penalized method. Panels a)–c) are scatter plots of the estimated variances in cluster 1 versus those in cluster 2, 3 and 4, respectively; panels d)–g) are the scatter plots of the estimated means versus estimated variances for the four clusters respectively.
Fig 2
Fig 2
Observed expression levels of two pairs of genes and the corresponding clusters found by the two penalized methods.
Fig 3
Fig 3
Penalized mean and variance estimates for cluster one containing the 11 ALL B-cell samples by the new penalized method.
Fig 4
Fig 4
Agglomerative hierarchical clustering results for the 38 leukemia samples: the first 8 samples were T-cell ALL; samples 9–27 were B-cell ALL; the remaining ones were AML.
Fig 5
Fig 5
Comparison of the two regularization schemes on the variance parameters for one dataset of set-up 3. σ̂is is MPLE for cluster i by scheme s.
Fig 6
Fig 6
Comparison of the two regularization schemes on the variance parameters for Golub’s data with the top 2000 genes. X-axis and y-axis give the MPLEs by scheme 1 and scheme 2 respectively.
Fig 7
Fig 7
Comparison of the penalized variance estimates by regularization scheme 1 and the sample variances for Golub’s data with the top 2000 genes.
Fig 8
Fig 8
Comparison of the penalized variance estimates by regularization scheme 2 and the sample variances for Golub’s data with the top 2000 genes.

Similar articles

Cited by

References

    1. Alaiya AA, et al. Molecular classification of borderline ovarian tumors using hierarchical cluster analysis of protein expression profiles. Int J Cancer. 2002;98:895–899. - PubMed
    1. Antonov AV, Tetko IV, Mader MT, Budczies J, Mewes HW. Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics. 2004;20:644–652. - PubMed
    1. Baker Stuart G, Kramer Barnett S. Identifying genes that contribute most to good classification in microarrays. BMC Bioinformatics. 2006 Sep 7;7:407. - PMC - PubMed
    1. Bardi E, Bobok I, Olah AV, Olah E, Kappelmayer J, Kiss C. Cystatin C is a suitable marker of glomerular function in children with cancer. Pediatric Nephrology. 2004;19:1145–1147. - PubMed
    1. Bickel PJ, Levina E. Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010. MR2108040.

LinkOut - more resources