iSeqSearch: incremental protein search for iBlast/iMMSeqs2/iDiamond

Hyunwoo Yoo¹, Mohammadsaleh Refahi¹, Robi Polikar², Bahrad A Sokhansanj¹, James R Brown¹, Gail L Rosen¹

Affiliations

¹ Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States of America.
² Electrical and Computer Engineering, Rowan University, Glassboro, NJ, United States of America.

PMID: 40313391
PMCID: PMC12045279
DOI: 10.7717/peerj.19171

iSeqSearch: incremental protein search for iBlast/iMMSeqs2/iDiamond

Hyunwoo Yoo et al. PeerJ. 2025.

. 2025 Apr 28:13:e19171.

doi: 10.7717/peerj.19171. eCollection 2025.

Authors

Hyunwoo Yoo¹, Mohammadsaleh Refahi¹, Robi Polikar², Bahrad A Sokhansanj¹, James R Brown¹, Gail L Rosen¹

Affiliations

¹ Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States of America.
² Electrical and Computer Engineering, Rowan University, Glassboro, NJ, United States of America.

PMID: 40313391
PMCID: PMC12045279
DOI: 10.7717/peerj.19171

Abstract

Background: The advancement of sequencing technology has led to a rapid increase in the amount of DNA and protein sequence data; consequently, the size of genomic and proteomic databases is constantly growing. As a result, database searches need to be continually updated to account for the new data being added. However, continually re-searching the entire existing dataset wastes resources. Incremental database search can address this problem.

Methods: One recently introduced incremental search method is iBlast, which wraps the BLAST sequence search method with an algorithm to reuse previously processed data and thereby increase search efficiency. The iBlast wrapper, however, must be generalized to support better performing DNA/protein sequence search methods that have been developed, namely MMseqs2 and Diamond. To address this need, we propose iSeqsSearch, which extends iBlast by incorporating support for MMseqs2 (iMMseqs2) and Diamond (iDiamond), thereby providing a more generalized and broadly effective incremental search framework. Moreover, the previously published iBlast wrapper has to be revised to be more robust and usable by the general community.

Results: iMMseqs2 and iDiamond, which apply the incremental approach, perform nearly identical to MMseqs2 and Diamond. Notably, when comparing ranking comparison methods such as the Pearson correlation, we observe a high concordance of over 0.9, indicating similar results. Moreover, in some cases, our incremental approach, iSeqsSearch, which extends the iBlast merge function to iMMseqs2 and iDiamond, provides more hits compared to the conventional MMseqs2 and Diamond methods.

Conclusion: The incremental approach using iMMseqs2 and iDiamond demonstrates efficiency in terms of reusing previously processed data while maintaining high accuracy and concordance in search results. This method can reduce resource waste in continually growing genomic and proteomic database searches. The sample codes and data are available at GitHub and Zenodo (https://github.com/EESI/Incremental-Protein-Search; DOI: 10.5281/zenodo.14675319).

Keywords: Bioinformatics tools; Incremental Search; Incremental protein search; Protein database search; Protein function prediction; Protein search algorithms; Protein sequence analysis; Sequence Alignment.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1. The left graph (A) illustrates increasing e-values and hit counts when comparing incremental methods (iSeqsSearch including iBlastp, iMMseqs2, iDiamond) to non-incremental methods (Blastp, MMseqs2, Diamond). The x-axis represents the ratio of the number of hits identified by incremental methods to those identified by non-incremental methods. The y-axis represents the ratio of the average e-values obtained by incremental methods to those obtained by non-incremental methods. In this graph, only e-values below 1e-5 are considered. The right graph (B) compares the processing times of these methods, showing that incremental methods are faster. Additional Batch for Merge refers to a scenario in incremental search where the database size increases progressively with the addition of multiple batches. The value of Additional Batch for Merge determines the number of batches contributing to the database size.

**Figure 2. Heatmaps representing the Kendall tau correlation coefficient and the Pearson correlation coefficient for each search result.**
In the searches, 1/10th of the Scope Astral protein dataset is used as the query, and the remaining dataset is used as the database. The database is randomly sampled and divided into nine equal batches, which are then incrementally combined based on the search results of the fully combined dataset. In the Pearson correlation heatmap, iBlastp and Blastp achieve a score of 0.97, which was the highest among the methods, while iMMseqs2 and MMseqs2, as well as diamond and Diamond, show a score of 1.0, indicating the highest similarity. For the Kendall tau correlation, iBlastp and Blastp scored 0.93, which was higher than the other methods, and iDiamond and Diamond also had a higher score of 0.85. These observations indicate that iBlastp and Blastp, iMMseqs2 and MMseqs2, and iDiamond and Diamond all provide similar results.

See this image and copyright information in PMC

References

1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
1. Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods. 2021;18:366–368. doi: 10.1038/s41592-021-01101-x. - DOI - PMC - PubMed
1. Chandonia J-M, Guan L, Lin S, Yu C, Fox NK, Brenner SE. SCOPe: improvements to the structural classification of proteins—extended database to facilitate variant interpretation and machine learning. Nucleic Acids Research. 2022;50(Database issue):D553–D559. doi: 10.1093/nar/gkab1054. - DOI - PMC - PubMed
1. Choudhuri S. Cambridge: Academic PressBioinformatics for beginners: genes, genomes, molecular evolution, databases and analytical tools. 2014 doi: 10.1016/C2012-0-07153-0. - DOI
1. Dash S, Rahman S, Hines H, Feng W-C. iBLAST: incremental BLAST of new sequences via automated e-value correction. PLOS ONE. 2021;16(4):e0249410. doi: 10.1371/journal.pone.0249410. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PeerJ, Inc.
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

iSeqSearch: incremental protein search for iBlast/iMMSeqs2/iDiamond

Affiliations

iSeqSearch: incremental protein search for iBlast/iMMSeqs2/iDiamond

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials