. 2014 Apr 4;9(4):e91315.

doi: 10.1371/journal.pone.0091315. eCollection 2014.

Parallel clustering algorithm for large-scale biological data sets

Minchao Wang¹, Wu Zhang², Wang Ding¹, Dongbo Dai¹, Huiran Zhang¹, Hao Xie³, Luonan Chen⁴, Yike Guo⁵, Jiang Xie¹

Affiliations

¹ School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.
² School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; High Performance Computing Center, Shanghai University, Shanghai, P.R.China.
³ College of Stomatology, Wuhan University, Wuhan, P.R.China.
⁴ School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R.China.
⁵ School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; Department of Computing, Imperial College London, London, United Kingdom.

PMID: 24705246
PMCID: PMC3976248
DOI: 10.1371/journal.pone.0091315

Parallel clustering algorithm for large-scale biological data sets

Minchao Wang et al. PLoS One. 2014.

. 2014 Apr 4;9(4):e91315.

doi: 10.1371/journal.pone.0091315. eCollection 2014.

Authors

Minchao Wang¹, Wu Zhang², Wang Ding¹, Dongbo Dai¹, Huiran Zhang¹, Hao Xie³, Luonan Chen⁴, Yike Guo⁵, Jiang Xie¹

Affiliations

¹ School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.
² School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; High Performance Computing Center, Shanghai University, Shanghai, P.R.China.
³ College of Stomatology, Wuhan University, Wuhan, P.R.China.
⁴ School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R.China.
⁵ School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; Department of Computing, Imperial College London, London, United Kingdom.

PMID: 24705246
PMCID: PMC3976248
DOI: 10.1371/journal.pone.0091315

Abstract

Backgrounds: Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

Methods: Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

Result: A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Runtime and speedup of constructing similarity matrix for different data partition ways.**
The axis represents the number of cores, and the axis represents the runtime and speedup.

formula image — **Figure 1. Runtime and speedup of constructing similarity matrix for different data partition ways.**
The axis represents the number of cores, and the axis represents the runtime and speedup.

**Figure 2. Runtime and speedup of parallel affinity propagation algorithm.**
(a) shows the runtime of the five biological data sets. (b) shows the speedup of cd40 and enolase data sets.

**Figure 3. The F-Measure score of some large clusters on two protein data sets.**
The F-measure score, recall score and precision score are calculated. The axis and axis represent the cluster and its corresponding best score value, respectively. Some tiny and singleton clusters are not considered in the figure.

**Figure 4. Partition of the input biological matrix.**
Rows of the input matrix are assigned to different cores to calculate the similarities with others (each row represents a data point; and represent the row number of the input matrix and the dimension of each row, respectively; represents the similarity between data point and ; represents the number of available cores).

**Figure 5. The computing load on cores for different data partition ways.**
(a) The computing load of whole data set on one core. The computing loads on the cores which are assigned the upper rows of input matrix are much more than that on the cores which are assigned the lower rows. (b) The computing load on cores when the input matrix is partitioned by the sequence partition. (c) The computing load on cores when the input matrix is partitioned by the shutter partition.

**Figure 6. Partition of three information matrices.**
All three matrices are partitioned by the row. For a matrix with rows and a computing cluster with machine nodes, each node is assigned about rows of each information matrix. In each node, the rows are processed by cores concurrently.

**Figure 7. The procedure of computing availability messages.**
There are processes , and each process is assigned about rows of the availability messages matrix after partitioning. In each row, there are column. In order to reduce the communication cost, each process computes the local summation of the columns and stores the intermediate values in the Local Summation Array firstly, and then these local values in all processes are gathered and scattered to compute the global summation of the columns.

See this image and copyright information in PMC

Cited by

Machine learning for biomedical literature triage.
Almeida H, Meurs MJ, Kosseim L, Butler G, Tsang A. Almeida H, et al. PLoS One. 2014 Dec 31;9(12):e115892. doi: 10.1371/journal.pone.0115892. eCollection 2014. PLoS One. 2014. PMID: 25551575 Free PMC article.
A comprehensive review of machine learning algorithms and their application in geriatric medicine: present and future.
Woodman RJ, Mangoni AA. Woodman RJ, et al. Aging Clin Exp Res. 2023 Nov;35(11):2363-2397. doi: 10.1007/s40520-023-02552-2. Epub 2023 Sep 8. Aging Clin Exp Res. 2023. PMID: 37682491 Free PMC article. Review.
Applications of Community Detection Algorithms to Large Biological Datasets.
Kanter I, Yaari G, Kalisky T. Kanter I, et al. Methods Mol Biol. 2021;2243:59-80. doi: 10.1007/978-1-0716-1103-6_3. Methods Mol Biol. 2021. PMID: 33606252
paraGSEA: a scalable approach for large-scale gene expression profiling.
Peng S, Yang S, Bo X, Li F. Peng S, et al. Nucleic Acids Res. 2017 Sep 29;45(17):e155. doi: 10.1093/nar/gkx679. Nucleic Acids Res. 2017. PMID: 28973463 Free PMC article.
Similarity measure and domain adaptation in multiple mixture model clustering: An application to image processing.
Leong SH, Ong SH. Leong SH, et al. PLoS One. 2017 Jul 7;12(7):e0180307. doi: 10.1371/journal.pone.0180307. eCollection 2017. PLoS One. 2017. PMID: 28686634 Free PMC article.

See all "Cited by" articles

References

1. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30: 1575–1584. - PMC - PubMed
1. Paccanaro A, Casbon JA, Saqi MAS (2006) Spectral clustering of protein sequences. Nucleic Acids Research 34: 1571–1580. - PMC - PubMed
1. Guimera R, Nunes Amaral LA (2005) Functional cartography of complex metabolic networks. Nature 433: 895–900. - PMC - PubMed
1. Hanisch D, Zien A, Zimmer R, Lengauer T (2002) Co-clustering of biological networks and gene expression data. Bioinformatics 18: S145–S154. - PubMed
1. Mazurie A, Bonchev D, Schwikowski B, Buck G (2010) Evolution of metabolic network organization. BMC Systems Biology 4: 59. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Parallel clustering algorithm for large-scale biological data sets

Affiliations

Parallel clustering algorithm for large-scale biological data sets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources