Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan 1:3:1473-1496.
doi: 10.1214/09-EJS487.

Penalized model-based clustering with unconstrained covariance matrices

Affiliations

Penalized model-based clustering with unconstrained covariance matrices

Hui Zhou et al. Electron J Stat. .

Abstract

Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering. However, existing methods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model permitting a treatment of general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariance matrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm utilizing the graphical lasso (Friedman et al 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.

PubMed Disclaimer

Figures

Fig 1
Fig 1. Simulated data from only one cluster (top panels) and from two clusters (bottom panels) with K = 2
Fig 2
Fig 2. Expression levels of gene pairs (HG613-HT613, M23197) and (X95735, M38591), and the corresponding clusters for the leukemia data
Fig 3
Fig 3. Expression levels of gene pairs (A2BP1, FMR1) and (APRT, SSBP2), and the corresponding clusters for the BOEC data

Similar articles

Cited by

References

    1. Alaiya AA, et al. Molecular classification of borderline ovarian tumors using hierarchical cluster analysis of protein expression profiles. Int. J. Cancer. 2002;98:895–899. - PubMed
    1. Baker Stuart G., Kramer Barnett S. Identifying genes that contribute most to good classification in microarray. BMC Bioinformatics. 2006 Sep 7;7:407. - PMC - PubMed
    1. Banfield JD, Raftery AE. Model-Based Gaussian and Non-Gaussian Clustering. Biometrics. 1993;49:803–821.
    1. Bardi E, Bobok I, Olah AV, Olah E, Kappelmayer J, Kiss C. Cystatin C is a suitable marker of glomerular function in children with cancer. Pediatric Nephrology. 2004;19:1145–1147. - PubMed
    1. Carvalho CM, Scott JG. Objective Bayesian model selection in Gaussian graphical models. Biometrika. 2009;96:497–512.

LinkOut - more resources