Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 22;16(4):e0249410.
doi: 10.1371/journal.pone.0249410. eCollection 2021.

iBLAST: Incremental BLAST of new sequences via automated e-value correction

Affiliations

iBLAST: Incremental BLAST of new sequences via automated e-value correction

Sajal Dash et al. PLoS One. .

Abstract

Search results from local alignment search tools use statistical scores that are sensitive to the size of the database to report the quality of the result. For example, NCBI BLAST reports the best matches using similarity scores and expect values (i.e., e-values) calculated against the database size. Given the astronomical growth in genomics data throughout a genomic research investigation, sequence databases grow as new sequences are continuously being added to these databases. As a consequence, the results (e.g., best hits) and associated statistics (e.g., e-values) for a specific set of queries may change over the course of a genomic investigation. Thus, to update the results of a previously conducted BLAST search to find the best matches on an updated database, scientists must currently rerun the BLAST search against the entire updated database, which translates into irrecoverable and, in turn, wasted execution time, money, and computational resources. To address this issue, we devise a novel and efficient method to redeem past BLAST searches by introducing iBLAST. iBLAST leverages previous BLAST search results to conduct the same query search but only on the incremental (i.e., newly added) part of the database, recomputes the associated critical statistics such as e-values, and combines these results to produce updated search results. Our experimental results and fidelity analyses show that iBLAST delivers search results that are identical to NCBI BLAST at a substantially reduced computational cost, i.e., iBLAST performs (1 + δ)/δ times faster than NCBI BLAST, where δ represents the fraction of database growth. We then present three different use cases to demonstrate that iBLAST can enable efficient biological discovery at a much faster speed with a substantially reduced computational cost.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Addition of new sequences.
(A) BLAST search when new sequences are added to the database. At time t, the database is Dt. In next δt interval, new sequences Dt+δtDt are added, and the database becomes Dt+δt. With the traditional approach, the prior search result at time t cannot be reused, and we have to perform an entire BLAST search against the entire Dt+δt database. (B) BLAST search when several taxon-specific databases are present and a result against the combined database is needed. For three taxa, A, B, and C, we can perform individual BLAST searches against the databases DA, DB, DC, respectively. If we want to obtain a search result against the combined database DABC, we need to merge the search results in a way that their e-values reflect the combined database size.
Fig 2
Fig 2. Software stack of iBLAST.
The user can initiate a search using the user interface. The search parameters are then passed to the “Incremental logic” module. After performing an incremental search, this module’s back-end corrects the e-value statistics and merges the result. The “Incremental logic” module looks into an external lightweight database module called the (Record database) to decide whether and how to perform the incremental search. For the actual search and delta database creation, we use NCBI BLAST tools such as blastdbcmd, blastdbalias, blastp, and blastn.
Fig 3
Fig 3. Experimental design of three case studies.
(A) Case study I: Incremental addition of sequences in the nt database over three time periods. (B) Case study II: Incremental addition of sequences in the nr database over two time periods. (C) Case study III: Incremental search of taxon-specific databases.
Fig 4
Fig 4. Performance comparison between NCBI BLAST and iBLAST for case study I.
(A) Performance comparison between regular blastn and incremental blastn at 3 periods when nt database is growing over time, using 100 nucleotide queries. For 40.8% and 34.0% increase in the database size, iBLAST performs 2.93 and 3.03 times faster respectively. (B) Performance comparison between regular blastp and incremental blastp at 3 periods when nr database is growing over time, using 100 protein queries. For 34.1% and 26.3% increase in the database size, iBLAST performs 4.33 and 4.98 times faster respectively.

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403–410. 10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD, et al.. GenBank. Nucleic Acids Research. 2017;46(D1):D41–D47. 10.1093/nar/gkw1070 - DOI - PMC - PubMed
    1. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al.. Big data: astronomical or genomical? PLoS biology. 2015;13(7):e1002195. 10.1371/journal.pbio.1002195 - DOI - PMC - PubMed
    1. Eddy SR. Profile hidden Markov models. Bioinformatics (Oxford, England). 1998;14(9):755–763. 10.1093/bioinformatics/14.9.755 - DOI - PubMed
    1. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature methods. 2014;12(1):59. - PubMed

Publication types