Clustered sequence representation for fast homology search
- PMID: 17683263
- DOI: 10.1089/cmb.2007.R005
Clustered sequence representation for fast homology search
Abstract
We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.
Similar articles
-
Comparing compressed sequences for faster nucleotide BLAST searches.IEEE/ACM Trans Comput Biol Bioinform. 2007 Jul-Sep;4(3):349-64. doi: 10.1109/TCBB.2007.1029. IEEE/ACM Trans Comput Biol Bioinform. 2007. PMID: 17666756
-
muBLASTP: database-indexed protein sequence search on multicore CPUs.BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4. BMC Bioinformatics. 2016. PMID: 27809763 Free PMC article.
-
FastBLAST: homology relationships for millions of proteins.PLoS One. 2008;3(10):e3589. doi: 10.1371/journal.pone.0003589. Epub 2008 Oct 31. PLoS One. 2008. PMID: 18974889 Free PMC article.
-
Sequence Similarity Searching.Curr Protoc Protein Sci. 2019 Feb;95(1):e71. doi: 10.1002/cpps.71. Epub 2018 Aug 13. Curr Protoc Protein Sci. 2019. PMID: 30102464 Review.
-
Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment.Int J Comput Biol Drug Des. 2008;1(4):347-67. doi: 10.1504/ijcbdd.2008.022207. Int J Comput Biol Drug Des. 2008. PMID: 20063463 Review.
Cited by
-
UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.Bioinformatics. 2015 Mar 15;31(6):926-32. doi: 10.1093/bioinformatics/btu739. Epub 2014 Nov 13. Bioinformatics. 2015. PMID: 25398609 Free PMC article.
-
Minimizing proteome redundancy in the UniProt Knowledgebase.Database (Oxford). 2016 Dec 26;2016:baw139. doi: 10.1093/database/baw139. Print 2016. Database (Oxford). 2016. PMID: 28025334 Free PMC article.
-
Compressive genomics for protein databases.Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214. Bioinformatics. 2013. PMID: 23812995 Free PMC article.
-
Improving the Quality of Wheat Flour Bread by a Thermophilic Xylanase with Ultra Activity and Stability Reconstructed by Ancestral Sequence and Computational-Aided Analysis.Molecules. 2024 Apr 22;29(8):1895. doi: 10.3390/molecules29081895. Molecules. 2024. PMID: 38675714 Free PMC article.
-
TBC: a clustering algorithm based on prokaryotic taxonomy.J Microbiol. 2012 Apr;50(2):181-5. doi: 10.1007/s12275-012-1214-6. Epub 2012 Apr 27. J Microbiol. 2012. PMID: 22538644
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials