Distance-based clustering challenges for unbiased benchmarking studies

Michael C Thrun^{1

2}

Affiliations

¹ DataBionics AG, Mathematics and Computer Science, The University of Marburg, Hans-Meerwein Str, 35032, Marburg, Germany. m.thrun@mathematik.uni.marburg.de.
² Department of Hematology, Oncology and Immunology, Philipps-Universität Marburg, Hans-Meerwein-Straße 6, 04A28, 35032, Marburg, Germany. m.thrun@mathematik.uni.marburg.de.

PMID: 34556686
PMCID: PMC8460803
DOI: 10.1038/s41598-021-98126-1

Distance-based clustering challenges for unbiased benchmarking studies

Michael C Thrun. Sci Rep. 2021.

. 2021 Sep 23;11(1):18988.

doi: 10.1038/s41598-021-98126-1.

Author

Michael C Thrun^{1

2}

Affiliations

¹ DataBionics AG, Mathematics and Computer Science, The University of Marburg, Hans-Meerwein Str, 35032, Marburg, Germany. m.thrun@mathematik.uni.marburg.de.
² Department of Hematology, Oncology and Immunology, Philipps-Universität Marburg, Hans-Meerwein-Straße 6, 04A28, 35032, Marburg, Germany. m.thrun@mathematik.uni.marburg.de.

PMID: 34556686
PMCID: PMC8460803
DOI: 10.1038/s41598-021-98126-1

Erratum in

Publisher Correction: Distance-based clustering challenges for unbiased benchmarking studies.
Thrun MC. Thrun MC. Sci Rep. 2021 Oct 6;11(1):20245. doi: 10.1038/s41598-021-99687-x. Sci Rep. 2021. PMID: 34615989 Free PMC article. No abstract available.

Abstract

Benchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partition comparison measures can yield the same results for different clustering solutions. Consequently, algorithm selection and parameter optimization by unsupervised quality measures (QM) are always biased and misleading. Only if the predefined structures happen to meet the particular clustering criterion and QM, can the clusters be recovered. Results are presented based on 41 open-source algorithms which are particularly useful in biomedical scenarios. Furthermore, comparative analysis with mirrored density plots provides a significantly more detailed benchmark than that with the typically used box plots or violin plots.

PubMed Disclaimer

Conflict of interest statement

The author declares no competing interests.

Figures

**Figure 1**
The coloured points of the two SOM clusters of the GolfBall dataset. The figure on the left shows an optimal clustering of 0.83 for the Davies–Bouldin index, and the figure on the right shows the worst case of 11.8 for the Davies–Bouldin index.

**Figure 2**
MD-plots of the micro-averaged F1 score (left) and Davies–Bouldin index (right) across 120 trials for 33 clustering algorithms calculated on the leukaemia dataset. Distance-based structures with imbalanced classes are not easy to tackle in high-dimensional data. The chance level is shown by the dotted line at 50%. The choice of an algorithm by the Davies–Bouldin index would lead to the selection of the CentroidL or for some trials VarSelLCM algorithms, whereas using the ground truth shows that AverageL, CompleteL, DBS, Diana SingleL and WPGMA are appropriate algorithms to reproduce the high-dimensional structures with low variance and bias. The results for Clustvarsel CrossEntropyC, ModelBased, mvnpEM, npEM, Orclus, RTC, and Spectrum could not be computed. Note that, Markov clustering results in only one cluster in which case the Davies-Bouldin index is not defined.

See this image and copyright information in PMC

References

1. Wu L, et al. A transcriptome-wide association study of 229,000 women identifies new candidate susceptibility genes for breast cancer. Nat. Genet. 2018;50:968–978. - PMC - PubMed
1. Mack EK, et al. Comprehensive genetic diagnosis of acute myeloid leukemia by next-generation sequencing. Haematologica. 2019;104:277–287. - PMC - PubMed
1. Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R. Advances in Knowledge Discovery and Data Mining. Menlo Park, CA: American Association for Artificial Intelligence Press; 1996.
1. Wiwie C, Baumbach J, Röttger R. Comparing the performance of biomedical clustering methods. Nat. Methods. 2015;12:1033. - PubMed
1. Bonner RE. On some clustering technique. IBM J. Res. Dev. 1964;8:22–32.

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Distance-based clustering challenges for unbiased benchmarking studies

Affiliations

Distance-based clustering challenges for unbiased benchmarking studies

Author

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources