Improved BLAST searches using longer words for protein seeding
- PMID: 17921491
- DOI: 10.1093/bioinformatics/btm479
Improved BLAST searches using longer words for protein seeding
Abstract
Motivation: The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters.
Availability: The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords
Similar articles
-
Database indexing for production MegaBLAST searches.Bioinformatics. 2008 Aug 15;24(16):1757-64. doi: 10.1093/bioinformatics/btn322. Epub 2008 Jun 21. Bioinformatics. 2008. PMID: 18567917 Free PMC article.
-
muBLASTP: database-indexed protein sequence search on multicore CPUs.BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4. BMC Bioinformatics. 2016. PMID: 27809763 Free PMC article.
-
WindowMasker: window-based masker for sequenced genomes.Bioinformatics. 2006 Jan 15;22(2):134-41. doi: 10.1093/bioinformatics/bti774. Epub 2005 Nov 15. Bioinformatics. 2006. PMID: 16287941
-
Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book.Nat Methods. 2004 Dec;1(3):195-202. doi: 10.1038/nmeth725. Nat Methods. 2004. PMID: 15789030 Review.
-
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994. Nucleic Acids Res. 2001. PMID: 11452024 Free PMC article. Review.
Cited by
-
SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier.Gigascience. 2019 Oct 1;8(10):giz118. doi: 10.1093/gigascience/giz118. Gigascience. 2019. PMID: 31648300 Free PMC article.
-
SssP1, a Streptococcus suis Fimbria-Like Protein Transported by the SecY2/A2 System, Contributes to Bacterial Virulence.Appl Environ Microbiol. 2018 Aug 31;84(18):e01385-18. doi: 10.1128/AEM.01385-18. Print 2018 Sep 15. Appl Environ Microbiol. 2018. PMID: 30030221 Free PMC article.
-
Global invasion history of the agricultural pest butterfly Pieris rapae revealed with genomics and citizen science.Proc Natl Acad Sci U S A. 2019 Oct 1;116(40):20015-20024. doi: 10.1073/pnas.1907492116. Epub 2019 Sep 10. Proc Natl Acad Sci U S A. 2019. PMID: 31506352 Free PMC article.
-
Initiation of Chromosomal Replication in Predatory Bacterium Bdellovibrio bacteriovorus.Front Microbiol. 2016 Nov 28;7:1898. doi: 10.3389/fmicb.2016.01898. eCollection 2016. Front Microbiol. 2016. PMID: 27965633 Free PMC article.
-
Dissecting the genetic basis of response to salmonid alphavirus in Atlantic salmon.BMC Genomics. 2025 Jul 11;26(1):657. doi: 10.1186/s12864-025-11735-2. BMC Genomics. 2025. PMID: 40646468 Free PMC article.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials