Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr 4;9(4):e91315.
doi: 10.1371/journal.pone.0091315. eCollection 2014.

Parallel clustering algorithm for large-scale biological data sets

Affiliations

Parallel clustering algorithm for large-scale biological data sets

Minchao Wang et al. PLoS One. .

Abstract

Backgrounds: Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

Methods: Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

Result: A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Runtime and speedup of constructing similarity matrix for different data partition ways.
The formula image axis represents the number of cores, and the formula image axis represents the runtime and speedup.
Figure 2
Figure 2. Runtime and speedup of parallel affinity propagation algorithm.
(a) shows the runtime of the five biological data sets. (b) shows the speedup of cd40 and enolase data sets.
Figure 3
Figure 3. The F-Measure score of some large clusters on two protein data sets.
The F-measure score, recall score and precision score are calculated. The formula image axis and formula image axis represent the cluster and its corresponding best score value, respectively. Some tiny and singleton clusters are not considered in the figure.
Figure 4
Figure 4. Partition of the input biological matrix.
Rows of the input matrix are assigned to different cores to calculate the similarities with others (each row represents a data point; formula image and formula image represent the row number of the input matrix and the dimension of each row, respectively; formula image represents the similarity between data point formula image and formula image; formula image represents the number of available cores).
Figure 5
Figure 5. The computing load on cores for different data partition ways.
(a) The computing load of whole data set on one core. The computing loads on the cores which are assigned the upper rows of input matrix are much more than that on the cores which are assigned the lower rows. (b) The computing load on cores when the input matrix is partitioned by the sequence partition. (c) The computing load on cores when the input matrix is partitioned by the shutter partition.
Figure 6
Figure 6. Partition of three information matrices.
All three matrices are partitioned by the row. For a matrix with formula image rows and a computing cluster with formula image machine nodes, each node is assigned about formula image rows of each information matrix. In each node, the formula image rows are processed by formula image cores concurrently.
Figure 7
Figure 7. The procedure of computing availability messages.
There are formula image processes formula image, and each process is assigned about formula image rows of the availability messages matrix after partitioning. In each row, there are formula image column. In order to reduce the communication cost, each process computes the local summation of the formula image columns and stores the intermediate values in the Local Summation Array firstly, and then these local values in all processes are gathered and scattered to compute the global summation of the formula image columns.

Similar articles

Cited by

References

    1. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30: 1575–1584. - PMC - PubMed
    1. Paccanaro A, Casbon JA, Saqi MAS (2006) Spectral clustering of protein sequences. Nucleic Acids Research 34: 1571–1580. - PMC - PubMed
    1. Guimera R, Nunes Amaral LA (2005) Functional cartography of complex metabolic networks. Nature 433: 895–900. - PMC - PubMed
    1. Hanisch D, Zien A, Zimmer R, Lengauer T (2002) Co-clustering of biological networks and gene expression data. Bioinformatics 18: S145–S154. - PubMed
    1. Mazurie A, Bonchev D, Schwikowski B, Buck G (2010) Evolution of metabolic network organization. BMC Systems Biology 4: 59. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources