Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

Tobias Wittkop¹, Jan Baumbach, Francisco P Lobo, Sven Rahmann

Affiliations

PMID: 17941985
PMCID: PMC2147039
DOI: 10.1186/1471-2105-8-396

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

Tobias Wittkop et al. BMC Bioinformatics. 2007.

. 2007 Oct 17:8:396.

doi: 10.1186/1471-2105-8-396.

Authors

Tobias Wittkop¹, Jan Baumbach, Francisco P Lobo, Sven Rahmann

Affiliation

¹ Computational Methods for Emerging Technologies, Bielefeld University, Bielefeld, Germany. tobias.wittkop@cebitec.uni-bielefeld.de

PMID: 17941985
PMCID: PMC2147039
DOI: 10.1186/1471-2105-8-396

Abstract

Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed.

Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences (66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet.

Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.

PubMed Disclaimer

Figures

**Figure 1**
Graphical summary of the obtained clustering results of FORCE for the two similarity functions (A) BeH and (B) SoH, and dataset ASTRAL95_1_161. We used MATLAB scripts provided by Paccanaro to create images similar to those of Figure 3 in [7]. Each row corresponds to a cluster. Green bars represent a protein assignment to a cluster; each protein is present in only one of the clusters. Boundaries between superfamilies are shown by vertical red lines, and boundaries between families within each superfamily are shown by dotted blue lines.

**Figure 2**
Comparison of the running times of FORCE against the exact fixed-parameter algorithm described in [2]. Plotted is the running time (y-axis in seconds) for different graph sizes (x-axis). Solely for visualization purposes, we describe the size of a graph on the x-axis as |V|·|E|. All graphs have been constructed from prokaryotic COG protein sequence comparisons using BeH as scoring function. Note that both axes are scaled logarithmically. The red points correspond to FORCE running times, and the blue points to the FP algorithm, respectively.

**Figure 3**
Relative cost deviations (y-axis in %) of the FORCE solutions from the optimal solutions found by the exact fixed-parameter algorithm described in [2]. The x-axis is as in Figure 2 (logarithmically scaled).

See this image and copyright information in PMC

References

1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
1. Rahmann S, Wittkop T, Baumbach J, Martin M, Truß A, Böcker S. Exact and Heuristic Algorithms for Weighted Cluster Editing. Comput Syst Bioinformatics Conf. 2007;6:391–401. - PubMed
1. Delvaux S, Horsten L. On best transitive approximations of simple graphs. Acta informatica. 2004;40:637–655.
1. Shamir R, Sharan R, Tsur D. Cluster graph modification problems. Discrete Applied Mathematics. 2004;144:173–182.
1. Pipenbacher P, Schliep A, Schneckener S, Schoenhuth A, Schomburg D, Schrader R. ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics. 2002;18:S182–S191. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

Affiliation

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources