Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Aug 15:14:248.
doi: 10.1186/1471-2105-14-248.

kClust: fast and sensitive clustering of large protein sequence databases

Affiliations

kClust: fast and sensitive clustering of large protein sequence databases

Maria Hauser et al. BMC Bioinformatics. .

Abstract

Background: Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.

Results: Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%-30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%-30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.

Conclusions: kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
k-mer matches comparison. Comparison of exact 3-mer matches (A) versus similar 6-mer matches with r = 100 (B) between two proteins with 45% sequence identity.
Figure 2
Figure 2
Prefiltering step in kClust. Prefiltering algorithm: For each k-mer in the query (k = 6), a list of similar k-mers and their BLOSUM62 similarity scores is generated (blue frame). For each such k-mer (red), a pointer to a list of representative sequences containing this k-mer is looked up in an array (index table). The score S of each sequence in that list is increased by the similarity score. After all k-mers in the query have been processed, array S contains for each representative sequence the sum of k-mer similarity scores.
Figure 3
Figure 3
Swapping. Memory swapping procedure of kClust. Explanation see text.
Figure 4
Figure 4
Iterative kClust. Overview over the iterative kClust method. First, kClust clusters the initial sequence database. Then, multiple sequence alignments are generated and profile and consensus sequences are computed for each cluster. Finally, profile-based kClust merges the clusters.
Figure 5
Figure 5
Generation of spacedk-mers. Spaced k-mers: The figure illustrates the generation of a list of spaced k-mers in the fast kClust prefilter algorithm (cf. Figure 2).
Figure 6
Figure 6
Performance of HHblits on the clustered UniProtKB. Fraction of queries with ROC5 value above the threshold on the x-axis, for one, two, and three HHblits iterations on the test set (5287 sequences from the SCOP 1.73 database). All but the last search iteration are performed against the UniProt. The last search iteration is done through a combined database containing the UniProt and the SCOP sequences. TPs are defined as pairs from the same SCOP folds, FPs as pairs from different folds, with the exception of Rossman folds and β propellers. The ROC5 value is the area under the ROC curve up to the 5th FP, normalized to yield a theoretical maximum of 1. The ROC5 plot is more robust to overfitting than the ROC curves.
Figure 7
Figure 7
kClust running time vs. clustering threshold. kClust running times dependency on sequence identity threshold, calculated on SwissProt.
Figure 8
Figure 8
Memory consumption of kClust and UCLUST for different database sizes.

References

    1. Chubb D, Jefferys BR, Sternberg MJE, Kelley LA. Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe. Bioinformatics. 2010;26(21):2664–2671. [ http://bioinformatics.oxfordjournals.org/content/26/21/2664.abstract] - PubMed
    1. Li W, Jaroszewski L, Godzik A. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng. 2002;15(8):643–649. [ http://view.ncbi.nlm.nih.gov/pubmed/12364578] - PubMed
    1. Park J, Holm L, Heger A, Chothia C. RSDB: representative protein sequence databases have high information content. Bioinformatics. 2000;16(5):458–464. [ http://view.ncbi.nlm.nih.gov/pubmed/10871268] - PubMed
    1. Suzek B, Huang H, McGarvey P, Mazumder R, Wu C. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–1288. - PubMed
    1. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers Y, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S. et al.The Sorcerer II global ocean sampling expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 2007;5(3):e77. [ http://dx.doi.org/10.1371/journal.pbio.0050077] - DOI - PMC - PubMed

Publication types