Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 18;11(1):4202.
doi: 10.1038/s41598-021-83340-8.

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Affiliations

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Gregoire Preud'homme et al. Sci Rep. .

Abstract

The choice of the most appropriate unsupervised machine-learning method for "heterogeneous" or "mixed" data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of "ready-to-use" tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

PubMed Disclaimer

Conflict of interest statement

Pr. Rossignol reports grants and personal fees from AstraZeneca, Bayer, CVRx, Fresenius, and Novartis, personal fees from Grunenthal, Servier, Stealth Peptides, Vifor Fresenius Medical Care Renal Pharma, Idorsia, NovoNordisk, Ablative Solutions, G3P, Corvidia, Relypsa. Pr Rossignol and Pr Zannad are the cofounder of CardioRenal. Pr Zannad has received fees for serving on the board of Boston Scientific; consulting fees from Novartis, Takeda, AstraZeneca, Boehringer-Ingelheim, GE Healthcare, Relypsa, Servier, Boston Scientific, Bayer, Johnson & Johnson, and Resmed; and speakers' fees from Pfizer and AstraZeneca. Pr Girerd reports personal fees from Novartis, Astra Zeneca and Boehringer. All other authors declare no competing interest related to this paper.

Figures

Figure 1
Figure 1
Influence of population size and number of clusters on clustering performance in simulation studies. This figure title was generated using R (R: A Language and Environment for Statistical Computing, R Core Team, R Foundation for Statistical Computing, Vienna, Austria, 2020, https://www.R-project.org).
Figure 2
Figure 2
Influence of characteristics of continuous and categorical variables on clustering performance in simulation studies. Top panels: scenario 3 applied to continuous (left) and categorical (right) variables; middle panels: scenarios 4 (left) and 5 (right) ; bottom panels : scenarios 6 (left) and 7 (right)—as defined in Table 2. This figure title was generated using R (R: A Language and Environment for Statistical Computing, R Core Team, R Foundation for Statistical Computing, Vienna, Austria, 2020, https://www.R-project.org).
Figure 3
Figure 3
Results obtained with the distance-based clustering algorithms in the EPHESUS dataset. This figure title was generated using R (R: A Language and Environment for Statistical Computing, R Core Team, R Foundation for Statistical Computing, Vienna, Austria, 2020, https://www.R-project.org).
Figure 4
Figure 4
Results obtained with the model-based clustering algorithms in the EPHESUS dataset. Radar charts were created from the means (per, by) group on scaled values (dichotomous and ordinal variables were coded numerically for convenience). The dot charts show the average Absolute Standardized Mean Difference (ASMD) over all clusters. This figure title was generated using R (R: A Language and Environment for Statistical Computing, R Core Team, R Foundation for Statistical Computing, Vienna, Austria, 2020, https://www.R-project.org).

References

    1. Ahmad A, Khan SS. Survey of state-of-the-art mixed data clustering algorithms. IEEE Access. 2019;7:31883–31902. doi: 10.1109/ACCESS.2019.2903568. - DOI
    1. Foss AH, Markatou MK. Clustering mixed-type data in R and hadoop. J. Stat. Softw. 2018;83(13):44. doi: 10.18637/jss.v083.i13. - DOI
    1. Foss, A., Markatou, M., & Ray, A. H. A semiparametric method for clustering mixed data. Mach. Learn. 105(3), 419–458 (2016).
    1. Pitt B, Remme W, Zannad F, et al. Eplerenone, a selective aldosterone blocker, in patients with left ventricular dysfunction after myocardial infarction. N Engl J Med. 2003;348(14):1309–1321. doi: 10.1056/NEJMoa030207. - DOI - PubMed
    1. Gower JC. A general coefficient of similarity and some of its properties. Biometrics. 1971;27(4):857–871. doi: 10.2307/2528823. - DOI

Publication types

LinkOut - more resources