Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
- PMID: 16731699
- DOI: 10.1093/bioinformatics/btl158
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Abstract
In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.
Similar articles
-
Search and clustering orders of magnitude faster than BLAST.Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12. Bioinformatics. 2010. PMID: 20709691
-
CD-HIT: accelerated for clustering the next-generation sequencing data.Bioinformatics. 2012 Dec 1;28(23):3150-2. doi: 10.1093/bioinformatics/bts565. Epub 2012 Oct 11. Bioinformatics. 2012. PMID: 23060610 Free PMC article.
-
Acceleration of sequence clustering using longest common subsequence filtering.BMC Bioinformatics. 2013;14 Suppl 8(Suppl 8):S7. doi: 10.1186/1471-2105-14-S8-S7. Epub 2013 May 9. BMC Bioinformatics. 2013. PMID: 23815271 Free PMC article.
-
Clustered sequence representation for fast homology search.J Comput Biol. 2007 Jun;14(5):594-614. doi: 10.1089/cmb.2007.R005. J Comput Biol. 2007. PMID: 17683263 Review.
-
Discovering sequence motifs.Methods Mol Biol. 2008;452:231-51. doi: 10.1007/978-1-60327-159-2_12. Methods Mol Biol. 2008. PMID: 18566768 Review.
Cited by
-
GBMPhos: A Gating Mechanism and Bi-GRU-Based Method for Identifying Phosphorylation Sites of SARS-CoV-2 Infection.Biology (Basel). 2024 Oct 6;13(10):798. doi: 10.3390/biology13100798. Biology (Basel). 2024. PMID: 39452107 Free PMC article.
-
A two-task predictor for discovering phase separation proteins and their undergoing mechanism.Brief Bioinform. 2024 Sep 23;25(6):bbae528. doi: 10.1093/bib/bbae528. Brief Bioinform. 2024. PMID: 39434494 Free PMC article.
-
FFPred 2.0: improved homology-independent prediction of gene ontology terms for eukaryotic protein sequences.PLoS One. 2013 May 22;8(5):e63754. doi: 10.1371/journal.pone.0063754. Print 2013. PLoS One. 2013. PMID: 23717476 Free PMC article.
-
The TolC-like protein HgdD of the cyanobacterium Anabaena sp. PCC 7120 is involved in secondary metabolite export and antibiotic resistance.J Biol Chem. 2012 Nov 30;287(49):41126-38. doi: 10.1074/jbc.M112.396010. Epub 2012 Oct 15. J Biol Chem. 2012. PMID: 23071120 Free PMC article.
-
SARS-CoV-2 genomic surveillance in Costa Rica: Evidence of a divergent population and an increased detection of a spike T1117I mutation.Infect Genet Evol. 2021 Aug;92:104872. doi: 10.1016/j.meegid.2021.104872. Epub 2021 Apr 24. Infect Genet Evol. 2021. PMID: 33905892 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials