. 2013 Aug 15:14:248.

doi: 10.1186/1471-2105-14-248.

kClust: fast and sensitive clustering of large protein sequence databases

Maria Hauser¹, Christian E Mayer, Johannes Söding

Affiliations

Affiliation

¹ Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str, 25, Munich 81377, Germany. soeding@genzentrum.lmu.de.

PMID: 23945046
PMCID: PMC3843501
DOI: 10.1186/1471-2105-14-248

kClust: fast and sensitive clustering of large protein sequence databases

Maria Hauser et al. BMC Bioinformatics. 2013.

. 2013 Aug 15:14:248.

doi: 10.1186/1471-2105-14-248.

Authors

Maria Hauser¹, Christian E Mayer, Johannes Söding

Affiliation

¹ Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str, 25, Munich 81377, Germany. soeding@genzentrum.lmu.de.

PMID: 23945046
PMCID: PMC3843501
DOI: 10.1186/1471-2105-14-248

Abstract

Background: Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.

Results: Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%-30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%-30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.

Conclusions: kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/.

PubMed Disclaimer

Figures

**Figure 1**
k-mer matches comparison. Comparison of exact 3-mer matches **(A)** versus similar 6-mer matches with r = 100 **(B)** between two proteins with 45% sequence identity.

**Figure 2**
**Prefiltering step in kClust.** Prefiltering algorithm: For each k-mer in the query (k = 6), a list of similar k-mers and their BLOSUM62 similarity scores is generated (blue frame). For each such k-mer (red), a pointer to a list of representative sequences containing this k-mer is looked up in an array (index table). The score S of each sequence in that list is increased by the similarity score. After all k-mers in the query have been processed, array S contains for each representative sequence the sum of k-mer similarity scores.

**Figure 3**
**Swapping.** Memory swapping procedure of kClust. Explanation see text.

**Figure 4**
**Iterative kClust.** Overview over the iterative kClust method. First, kClust clusters the initial sequence database. Then, multiple sequence alignments are generated and profile and consensus sequences are computed for each cluster. Finally, profile-based kClust merges the clusters.

**Figure 5**
**Generation of spaced** k-mers. Spaced k-mers: The figure illustrates the generation of a list of spaced k-mers in the fast kClust prefilter algorithm (cf. Figure 2).

**Figure 6**
**Performance of HHblits on the clustered UniProtKB.** Fraction of queries with ROC5 value above the threshold on the x-axis, for one, two, and three HHblits iterations on the test set (5287 sequences from the SCOP 1.73 database). All but the last search iteration are performed against the UniProt. The last search iteration is done through a combined database containing the UniProt and the SCOP sequences. TPs are defined as pairs from the same SCOP folds, FPs as pairs from different folds, with the exception of Rossman folds and β propellers. The ROC5 value is the area under the ROC curve up to the 5th FP, normalized to yield a theoretical maximum of 1. The ROC5 plot is more robust to overfitting than the ROC curves.

**Figure 7**
**kClust running time vs. clustering threshold.** kClust running times dependency on sequence identity threshold, calculated on SwissProt.

**Figure 8**
Memory consumption of kClust and UCLUST for different database sizes.

See this image and copyright information in PMC

References

1. Chubb D, Jefferys BR, Sternberg MJE, Kelley LA. Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe. Bioinformatics. 2010;26(21):2664–2671. [ http://bioinformatics.oxfordjournals.org/content/26/21/2664.abstract] - PubMed
1. Li W, Jaroszewski L, Godzik A. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng. 2002;15(8):643–649. [ http://view.ncbi.nlm.nih.gov/pubmed/12364578] - PubMed
1. Park J, Holm L, Heger A, Chothia C. RSDB: representative protein sequence databases have high information content. Bioinformatics. 2000;16(5):458–464. [ http://view.ncbi.nlm.nih.gov/pubmed/10871268] - PubMed
1. Suzek B, Huang H, McGarvey P, Mazumder R, Wu C. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–1288. - PubMed
1. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers Y, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S. et al.The Sorcerer II global ocean sampling expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 2007;5(3):e77. [ http://dx.doi.org/10.1371/journal.pbio.0050077] - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

kClust: fast and sensitive clustering of large protein sequence databases

Affiliation

kClust: fast and sensitive clustering of large protein sequence databases

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials