Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Gregoire Preud'homme^{1

2}, Kevin Duarte¹, Kevin Dalleau³, Claire Lacomblez¹, Emmanuel Bresso³, Malika Smaïl-Tabbone^{2

3}, Miguel Couceiro³, Marie-Dominique Devignes^{2

3}, Masatake Kobayashi^{1

2}, Olivier Huttin^{1

2}, João Pedro Ferreira^{1

2}, Faiez Zannad^{1

2}, Patrick Rossignol^{1

2}, Nicolas Girerd^{4

5

6}

Affiliations

¹ Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.
² F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France.
³ CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Université de Lorraine, Vandoeuvre-lès-Nancy, France.
⁴ Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France. n.girerd@chru-nancy.fr.
⁵ F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France. n.girerd@chru-nancy.fr.
⁶ Centre d'Investigation Clinique Pierre Drouin -INSERM - CHRU de Nancy, Institut Lorrain du cœur Et Des Vaisseaux Louis Mathieu, 4, Rue du Morvan, 54500, Vandœuvre-Lès-Nancy, France. n.girerd@chru-nancy.fr.

PMID: 33603019
PMCID: PMC7892576
DOI: 10.1038/s41598-021-83340-8

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Gregoire Preud'homme et al. Sci Rep. 2021.

. 2021 Feb 18;11(1):4202.

doi: 10.1038/s41598-021-83340-8.

Authors

Affiliations

¹ Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.
² F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France.
³ CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Université de Lorraine, Vandoeuvre-lès-Nancy, France.
⁴ Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France. n.girerd@chru-nancy.fr.
⁵ F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France. n.girerd@chru-nancy.fr.
⁶ Centre d'Investigation Clinique Pierre Drouin -INSERM - CHRU de Nancy, Institut Lorrain du cœur Et Des Vaisseaux Louis Mathieu, 4, Rue du Morvan, 54500, Vandœuvre-Lès-Nancy, France. n.girerd@chru-nancy.fr.

PMID: 33603019
PMCID: PMC7892576
DOI: 10.1038/s41598-021-83340-8

Abstract

The choice of the most appropriate unsupervised machine-learning method for "heterogeneous" or "mixed" data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of "ready-to-use" tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

PubMed Disclaimer

Conflict of interest statement

Pr. Rossignol reports grants and personal fees from AstraZeneca, Bayer, CVRx, Fresenius, and Novartis, personal fees from Grunenthal, Servier, Stealth Peptides, Vifor Fresenius Medical Care Renal Pharma, Idorsia, NovoNordisk, Ablative Solutions, G3P, Corvidia, Relypsa. Pr Rossignol and Pr Zannad are the cofounder of CardioRenal. Pr Zannad has received fees for serving on the board of Boston Scientific; consulting fees from Novartis, Takeda, AstraZeneca, Boehringer-Ingelheim, GE Healthcare, Relypsa, Servier, Boston Scientific, Bayer, Johnson & Johnson, and Resmed; and speakers' fees from Pfizer and AstraZeneca. Pr Girerd reports personal fees from Novartis, Astra Zeneca and Boehringer. All other authors declare no competing interest related to this paper.

Figures

**Figure 1**
Influence of population size and number of clusters on clustering performance in simulation studies. This figure title was generated using R (R: A Language and Environment for Statistical Computing, R Core Team, R Foundation for Statistical Computing, Vienna, Austria, 2020, https://www.R-project.org).

**Figure 2**
Influence of characteristics of continuous and categorical variables on clustering performance in simulation studies. Top panels: scenario 3 applied to continuous (left) and categorical (right) variables; middle panels: scenarios 4 (left) and 5 (right) ; bottom panels : scenarios 6 (left) and 7 (right)—as defined in Table 2. This figure title was generated using R (R: A Language and Environment for Statistical Computing, R Core Team, R Foundation for Statistical Computing, Vienna, Austria, 2020, https://www.R-project.org).

**Figure 3**
Results obtained with the distance-based clustering algorithms in the EPHESUS dataset. This figure title was generated using R (R: A Language and Environment for Statistical Computing, R Core Team, R Foundation for Statistical Computing, Vienna, Austria, 2020, https://www.R-project.org).

**Figure 4**
Results obtained with the model-based clustering algorithms in the EPHESUS dataset. Radar charts were created from the means (per, by) group on scaled values (dichotomous and ordinal variables were coded numerically for convenience). The dot charts show the average Absolute Standardized Mean Difference (ASMD) over all clusters. This figure title was generated using R (R: A Language and Environment for Statistical Computing, R Core Team, R Foundation for Statistical Computing, Vienna, Austria, 2020, https://www.R-project.org).

See this image and copyright information in PMC

References

1. Ahmad A, Khan SS. Survey of state-of-the-art mixed data clustering algorithms. IEEE Access. 2019;7:31883–31902. doi: 10.1109/ACCESS.2019.2903568. - DOI
1. Foss AH, Markatou MK. Clustering mixed-type data in R and hadoop. J. Stat. Softw. 2018;83(13):44. doi: 10.18637/jss.v083.i13. - DOI
1. Foss, A., Markatou, M., & Ray, A. H. A semiparametric method for clustering mixed data. Mach. Learn. 105(3), 419–458 (2016).
1. Pitt B, Remme W, Zannad F, et al. Eplerenone, a selective aldosterone blocker, in patients with left ventricular dysfunction after myocardial infarction. N Engl J Med. 2003;348(14):1309–1321. doi: 10.1056/NEJMoa030207. - DOI - PubMed
1. Gower JC. A general coefficient of similarity and some of its properties. Biometrics. 1971;27(4):857–871. doi: 10.2307/2528823. - DOI

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Affiliations

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources