Reducing the time requirement of k-means algorithm

Victor Chukwudi Osamor¹, Ezekiel Femi Adebiyi, Jelilli Olarenwaju Oyelade, Seydou Doumbia

Affiliations

PMID: 23239974
PMCID: PMC3519838
DOI: 10.1371/journal.pone.0049946

Reducing the time requirement of k-means algorithm

Victor Chukwudi Osamor et al. PLoS One. 2012.

. 2012;7(12):e49946.

doi: 10.1371/journal.pone.0049946. Epub 2012 Dec 11.

Authors

Victor Chukwudi Osamor¹, Ezekiel Femi Adebiyi, Jelilli Olarenwaju Oyelade, Seydou Doumbia

Affiliation

¹ Department of Computer and Information Sciences, College of Science and Technology, Covenant University, Ota, Ogun State, Nigeria.

PMID: 23239974
PMCID: PMC3519838
DOI: 10.1371/journal.pone.0049946

Abstract

Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space R(d) and an integer k. The problem is to determine a set of k points in R(d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARI(HA)). We found that when k is close to d, the quality is good (ARI(HA)>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARI(HA)>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Pseudocode of our Compute_MM Sub-program for *MMk-means*.**
We create a covariance matrix, computing the Pearson product moment correlation coefficient between the k centroids of the previous and current iterations and then deduce k previous and current iterations eigenvalues. The difference of these eigenvalues for each cluster is computed and checked to see if it satisfies the *Ding-H*e interval.

**Figure 2. Pseudocode of our main program for *MMk-means*.**
It runs similar to the traditional k-means except that it is equipped with a metric matrices based mechanism to determine when a cluster is stable (that is, its members will not move from this cluster in subsequent iteration). This mechanism is implemented in sub-procedure Compute_MM of *Figure* 1. We use the theory developed by Zha *et al.* from the singular values of the matrix X of the input data points to determine when it is appropriate to execute Compute_MM during the k-means iterations. This is implemented in lines 34–40.

**Figure 3. Quality of Clusters (Bozdech *et al.*, *P.f* 3D7 Microarray Dataset).**
The qualities of clusters for the four algorithms are similar. The MSE decreases gradually as the number of clusters increases except for k = 21 that has a higher MSE than when k = 20.

**Figure 4. Execution Time (Bozdech *et al.*, *P.f* 3D7 Microarray Dataset).**
The plot shows that our MMk-means has the fastest run-time for tested number of clusters, 15≤k≤25. Comparatively, k = 20 took the longest run-time for all the four algorithms, implying that this is a function of the nature of the data under consideration.

See this image and copyright information in PMC

References

1. Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Research 9: 1106–1115. - PMC - PubMed
1. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, et al. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96: 2907–2912. - PMC - PubMed
1. MacQueen J (1967) Some Methods for Classification and Analysis of Multivariate Observations. 5th Berkeley Symp. Math Statist Prob 1: 281–297.
1. Lloyd SP (1957) Least squares quantization in PCM. Bell Laboratories Internal Technical Report, IEEE Trans. on Information Theory
1. Hamerly G. and Elkan C (2003) Learning the k in kmeans. In proceedings of the seventeenth annual conference on neural information processing systems (NIPS). Available at http://www.citeseer.ist.psu.edu/hamerly03learning.html

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U41 HG006941/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reducing the time requirement of k-means algorithm

Affiliation

Reducing the time requirement of k-means algorithm

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous