Examining distributional characteristics of clusters
- PMID: 20653176
Examining distributional characteristics of clusters
Abstract
Standard cluster analysis creates clusters based on the criterion that their members be closer to each other than to members of other clusters. In this article, it is proposed to examine empirical clusters that result from standard clustering, with the goal of assessing whether they contradict distributional assumptions. Four models are proposed. The models consider two data generation processes, the Poisson and the multinormal, as well as two convex shapes of cluster hulls, the spherical and the ellipsoidal. Based on the model, the probability of being in a cluster of a given location, size, and shape is estimated. This probability is compared with the observed proportion of cases. The observed proportion can turn out to be larger, as large, or smaller than expected. Examples are given using simulated and empirical data. The simulation showed that the size of a cluster, the data generation process, and the true distribution of data have the strongest effect on the results obtained with the proposed method. The empirical examples discuss distributional characteristics of cross-sectional and longitudinal clusters of aggressive behavior in adolescents. The examples show that clustering methods do not always yield clusters that contradict distributional assumptions. Some clusters contain even fewer cases than expected.
Similar articles
-
Cumulative voting consensus method for partitions with variable number of clusters.IEEE Trans Pattern Anal Mach Intell. 2008 Jan;30(1):160-73. doi: 10.1109/TPAMI.2007.1138. IEEE Trans Pattern Anal Mach Intell. 2008. PMID: 18000332
-
Determining the number of clusters using the weighted gap statistic.Biometrics. 2007 Dec;63(4):1031-7. doi: 10.1111/j.1541-0420.2007.00784.x. Epub 2007 Apr 9. Biometrics. 2007. PMID: 17425640
-
Cluster pattern detection in spatial data based on Monte Carlo inference.Biom J. 2007 Aug;49(4):505-19. doi: 10.1002/bimj.200610326. Biom J. 2007. PMID: 17638287
-
A mixture model with random-effects components for clustering correlated gene-expression profiles.Bioinformatics. 2006 Jul 15;22(14):1745-52. doi: 10.1093/bioinformatics/btl165. Epub 2006 May 3. Bioinformatics. 2006. PMID: 16675467
-
Maximally selected measures of evidence of disease clusters.Stat Med. 2001 May 15-30;20(9-10):1443-60. doi: 10.1002/sim.681. Stat Med. 2001. PMID: 11343365 Review.