Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 26;18(5):e0286312.
doi: 10.1371/journal.pone.0286312. eCollection 2023.

Shape complexity in cluster analysis

Affiliations

Shape complexity in cluster analysis

Eduardo J Aguilar et al. PLoS One. .

Abstract

In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. ARIfnc versus AMImax for all partitions resulting from scaling as in the rightmost column of Table 2.
Fig 2
Fig 2
Results of the random trials with Problem P on Iris (A), BCW (B), BC-DR3 (C), BNA-DR3 (D), and BCW-Diag-10 (E), expanding on the summary given on the rightmost column of Table 2. Each point on each left panel corresponds to a trial and is color-coded according to the accompanying palette to reflect the value of ARIfnc it leads to by way of clustering with k-means. The point leading to the highest ARIfnc value is marked by the crosshair in the panel. Each right panel provides a view of how ARIfnc is distributed over all pertaining trials.
Fig 3
Fig 3
Reference partition for the Iris data set (leftmost column of panels) and the effects of two scaling schemes: Scaling by 1/σk (middle column) and scaling by αk/σk (rightmost column), with factors as in Table 3. Effects can be seen both with respect to the shape of the data set (top row of panels, all plots drawn to the same scale) and to the distribution of distances between samples (the rij’s; bottom row, all plots drawn to the same scale).
Fig 4
Fig 4. As in Fig 3, now for the BNA-DR3 data set.

References

    1. van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006;7:142. doi: 10.1186/1471-2164-7-142 - DOI - PMC - PubMed
    1. Edelbrock C. Mixture model tests of hierarchical clustering algorithms: the problem of classifying everybody. Multivar Behav Res. 1979;14:867–884. doi: 10.1207/s15327906mbr1403_6 - DOI - PubMed
    1. Milligan GW, Cooper MC. A study of standardization of variables in cluster analysis. J Classif. 1988;5:181–204. doi: 10.1007/BF01897163 - DOI
    1. Steinley D. Standardizing variables in k-means clustering. In: Banks D, McMorris FR, Arabie P, Gaul W, editors. Classification, Clustering, and Data Mining Applications. Berlin, Germany: Springer-Verlag; 2004. p. 53–60.
    1. Raymaekers J, Zamar RH. Pooled variable scaling for cluster analysis. Bioinformatics. 2020;36:3849–3855. doi: 10.1093/bioinformatics/btaa243 - DOI - PubMed

Publication types