Clustered sequence representation for fast homology search

Michael Cameron¹, Yaniv Bernstein, Hugh E Williams

Affiliations

PMID: 17683263
DOI: 10.1089/cmb.2007.R005

Review

Clustered sequence representation for fast homology search

Michael Cameron et al. J Comput Biol. 2007 Jun.

. 2007 Jun;14(5):594-614.

doi: 10.1089/cmb.2007.R005.

Authors

Michael Cameron¹, Yaniv Bernstein, Hugh E Williams

Affiliation

¹ School of Computer Science and Information Technology, RMIT University, Melbourne, Australia. mcam@cs.rmit.edu.au

PMID: 17683263
DOI: 10.1089/cmb.2007.R005

Abstract

We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.

PubMed Disclaimer

Cited by

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.
Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH; UniProt Consortium. Suzek BE, et al. Bioinformatics. 2015 Mar 15;31(6):926-32. doi: 10.1093/bioinformatics/btu739. Epub 2014 Nov 13. Bioinformatics. 2015. PMID: 25398609 Free PMC article.
Minimizing proteome redundancy in the UniProt Knowledgebase.
Bursteinas B, Britto R, Bely B, Auchincloss A, Rivoire C, Redaschi N, O'Donovan C, Martin MJ. Bursteinas B, et al. Database (Oxford). 2016 Dec 26;2016:baw139. doi: 10.1093/database/baw139. Print 2016. Database (Oxford). 2016. PMID: 28025334 Free PMC article.
Compressive genomics for protein databases.
Daniels NM, Gallant A, Peng J, Cowen LJ, Baym M, Berger B. Daniels NM, et al. Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214. Bioinformatics. 2013. PMID: 23812995 Free PMC article.
Improving the Quality of Wheat Flour Bread by a Thermophilic Xylanase with Ultra Activity and Stability Reconstructed by Ancestral Sequence and Computational-Aided Analysis.
Hu G, Hong X, Zhu M, Lei L, Han Z, Meng Y, Yang J. Hu G, et al. Molecules. 2024 Apr 22;29(8):1895. doi: 10.3390/molecules29081895. Molecules. 2024. PMID: 38675714 Free PMC article.
TBC: a clustering algorithm based on prokaryotic taxonomy.
Lee JH, Yi H, Jeon YS, Won S, Chun J. Lee JH, et al. J Microbiol. 2012 Apr;50(2):181-5. doi: 10.1007/s12275-012-1214-6. Epub 2012 Apr 27. J Microbiol. 2012. PMID: 22538644

See all "Cited by" articles

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Atypon
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustered sequence representation for fast homology search

Affiliation

Clustered sequence representation for fast homology search

Authors

Affiliation

Abstract

Similar articles

Cited by

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials