Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug;15(4):433-446.
doi: 10.1002/sam.11573. Epub 2022 Jan 31.

A General Iterative Clustering Algorithm

Affiliations

A General Iterative Clustering Algorithm

Ziqiang Lin et al. Stat Anal Data Min. 2022 Aug.

Abstract

The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.

Keywords: Clustering; Extremely Randomized Tree; Extremely randomized tree; Proximity; Random Forest; iterative RF clustering.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST The authors have no conflicts to disclose.

Figures

FIGURE 1
FIGURE 1
Scatterplots of petal length versus petal width features for the iris data for ground truth and 4 clustering methods. Ground truth clusters are setosa, versicolor and virginica shown in the upper left plot
FIGURE 2
FIGURE 2
Scatterplots of maximum heart rate achieved versus exercise‐induced ST depression for the heart disease data, for ground truth and 4 clustering methods
FIGURE 3
FIGURE 3
Silhouette scores over the range possible cluster numbers for the iris data. Note that the range of the ordinates in the two plots are not the same. The maximum for RFC is 6 and for IRFC is 2
FIGURE 4
FIGURE 4
Scatterplots of petal length versus petal width for the iris data for the number of clusters that maximized the Silhouette scores, 6 for RFC and 2 for IRFC. Ground truth with 3 clusters are shown in the upper left plot in Figure 1
FIGURE 5
FIGURE 5
Scatterplots of petal length versus petal width for the iris data for different initial labeling strategies all assuming 3 clusters. Ground truth clusters are shown in the upper left

References

    1. Alhusain L. and Hafez A. M., Cluster ensemble based on random forests for genetic data, BioData Min. 10 (2017), no. 1, 1–25. - PMC - PubMed
    1. Amit Y. and Geman D., Randomized inquiries about shape: An application to handwritten digit recognition, Chicago Univ IL Dept of Statistics, 1994.
    1. Aryal S., Ting K. M., Washio T., and Haffari G., A comparative study of data‐dependent approaches without learning in measuring similarities of data objects, Data Min. Knowl. Disc. 34 (2020), no. 1, 124–162.
    1. Bicego M., “K‐random forests: A k‐means style algorithm for random forest clustering,” 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–8.
    1. Bicego M. and Escolano F., “On learning random forests for random forest‐clustering,” 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, 2021, pp. 3451–3458.

LinkOut - more resources