A General Iterative Clustering Algorithm

Ziqiang Lin¹, Eugene Laska^{1

2}, Carole Siegel^{1

2}

Affiliations

¹ Department of Psychiatry, NYU Langone School of Medicine, New York, NY, USA.
² Department of Population Health, Division of Biostatistics, NYU Langone School of Medicine, New York, NY, USA.

PMID: 36061078
PMCID: PMC9438941
DOI: 10.1002/sam.11573

A General Iterative Clustering Algorithm

Ziqiang Lin et al. Stat Anal Data Min. 2022 Aug.

. 2022 Aug;15(4):433-446.

doi: 10.1002/sam.11573. Epub 2022 Jan 31.

Authors

Ziqiang Lin¹, Eugene Laska^{1

2}, Carole Siegel^{1

2}

Affiliations

¹ Department of Psychiatry, NYU Langone School of Medicine, New York, NY, USA.
² Department of Population Health, Division of Biostatistics, NYU Langone School of Medicine, New York, NY, USA.

PMID: 36061078
PMCID: PMC9438941
DOI: 10.1002/sam.11573

Abstract

The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.

Keywords: Clustering; Extremely Randomized Tree; Extremely randomized tree; Proximity; Random Forest; iterative RF clustering.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST The authors have no conflicts to disclose.

Figures

**FIGURE 1**
Scatterplots of petal length versus petal width features for the iris data for ground truth and 4 clustering methods. Ground truth clusters are setosa, versicolor and virginica shown in the upper left plot

**FIGURE 2**
Scatterplots of maximum heart rate achieved versus exercise‐induced ST depression for the heart disease data, for ground truth and 4 clustering methods

**FIGURE 3**
Silhouette scores over the range possible cluster numbers for the iris data. Note that the range of the ordinates in the two plots are not the same. The maximum for RFC is 6 and for IRFC is 2

**FIGURE 4**
Scatterplots of petal length versus petal width for the iris data for the number of clusters that maximized the Silhouette scores, 6 for RFC and 2 for IRFC. Ground truth with 3 clusters are shown in the upper left plot in Figure 1

**FIGURE 5**
Scatterplots of petal length versus petal width for the iris data for different initial labeling strategies all assuming 3 clusters. Ground truth clusters are shown in the upper left

See this image and copyright information in PMC

References

1. Alhusain L. and Hafez A. M., Cluster ensemble based on random forests for genetic data, BioData Min. 10 (2017), no. 1, 1–25. - PMC - PubMed
1. Amit Y. and Geman D., Randomized inquiries about shape: An application to handwritten digit recognition, Chicago Univ IL Dept of Statistics, 1994.
1. Aryal S., Ting K. M., Washio T., and Haffari G., A comparative study of data‐dependent approaches without learning in measuring similarities of data objects, Data Min. Knowl. Disc. 34 (2020), no. 1, 124–162.
1. Bicego M., “K‐random forests: A k‐means style algorithm for random forest clustering,” 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–8.
1. Bicego M. and Escolano F., “On learning random forests for random forest‐clustering,” 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, 2021, pp. 3451–3458.

Grants and funding

P01 AA027057/AA/NIAAA NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A General Iterative Clustering Algorithm

Affiliations

A General Iterative Clustering Algorithm

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials