Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct;22(10):2024-2027.
doi: 10.1038/s41592-025-02819-8. Epub 2025 Sep 18.

GPU-accelerated homology search with MMseqs2

Affiliations

GPU-accelerated homology search with MMseqs2

Felix Kallenborn et al. Nat Methods. 2025 Oct.

Abstract

Rapidly growing protein databases demand faster sensitive search tools. Here the graphics processing unit (GPU)-accelerated MMseqs2 delivers 6× faster single-protein searches than CPU methods on 2 × 64 cores, speeds previously requiring large protein batches. For larger query batches, it is the most cost-effective solution, outperforming the fastest alternative method by 2.4-fold with eight GPUs. It accelerates protein structure prediction with ColabFold 31.8× over the standard AlphaFold2 pipeline and protein structure search with Foldseek by 4-27×. MMseqs2-GPU is available under an open-source license at https://mmseqs.com/ .

PubMed Disclaimer

Conflict of interest statement

Competing interests: C.D., A.C., C.H., H.S. and K.D. are employed by NVIDIA. M.S. declares an outside interest in Stylus Medicine. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. MMseqs2-GPU workflow and gapless alignment performance.
a, Gapless alignment scans reference sequences against a query, ranking and filtering them by alignment scores. b, Sequences above a threshold proceed to gapped Smith–Waterman–Gotoh alignment. c, GPU-optimized gapless alignment splits the query profile into segments (up to 2,048 residues), loading them into fast shared memory for efficient access by GPU threads; warp shuffles allow efficient cross-thread data sharing for diagonal computations. d, GPU speedups (1, 2, 4 and 8 L40S GPUs) relative to a 2 × 64-core CPU for random sequence pairs (lengths 32–2048). e, GPU speedups (1 and 8 GPUs) versus a 2 × 64-core CPU for 6,370 queries searching against a 1×, 4× and 16× sized 30-million-protein reference database. The 16× set exceeds GPU memory, requiring database streaming at 7.575/11.676 TCUPS ≈ 64.9% of in-memory performance.
Fig. 2
Fig. 2. MMseqs2-GPU runtimes for homology search.
a, In single-batch processing of 6,370 queries against a 30-million-sequence database, MMseqs2-GPU on one L40S GPU (dark green; baseline in bold, horizontal) is ~16× faster than BLAST (dark blue) and ~178× faster than JackHMMER (purple; measured on 10% of queries). MMseqs2-GPU achieves further speedups (up to ~5×) by splitting databases across multiple GPUs (bright versus dark green). MMseqs2-GPU on a single L40S provided the lowest AWS cost for all batch sizes; MMseqs2-CPU k-mer was faster at a batch size of 6,370, but 1.6× more costly (bottom). b,c, MMseqs2-GPU accelerates structure prediction without compromising accuracy (0.70 ± 0.05 TM-score). On 20 CASP14 targets, ColabFold MMseqs2-GPU (green) was 1.65× faster than ColabFold-CPU k-mer (orange) and 31.8× faster than AlphaFold2 (JackHMMER+HHblits, violet). MMseqs2 searched 238 million cluster representatives and expanded to 1 billion members; JackHMMER searched 426 million sequences, and HHblits searched 81 million profiles containing 2.1 billion members. d, Foldseek-GPU on one L40S (dark green, baseline in bold, horizontal) is 4× faster than Foldseek-CPU k-mer (orange) at large batch sizes (6,370 queries). Eight L40S GPUs accelerate searches by 7× compared to one GPU, and 27× compared to Foldseek-CPU.
Extended Data Fig. 1
Extended Data Fig. 1. Combined gapless and gapped alignment TCUPS.
TCUPS of 1 and 8 GPU executions of the combined MMseqs2-GPU gapless and gapped alignment workflow for 6370 queries against target sets of 1, 2, 4, 8, and 16 times a 30 M protein database (Methods ‘Sensitivity’). 8 and 16 times executions exceed GPU RAM and are processed with database streaming. The latter is processed with 7.3/11.6TCUPS ≈ 63% of in- memory processing speed.

References

    1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol.215, 403–410 (1990). - DOI - PubMed
    1. Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol.7, e1002195 (2011). - DOI - PMC - PubMed
    1. Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods18, 366–368 (2021). - DOI - PMC - PubMed
    1. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive datasets. Nat. Biotechnol.35, 1026–1028 (2017). - DOI - PubMed
    1. Watson, J. D., Laskowski, R. A. & Thornton, J. M. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol.15, 275–284 (2005). - DOI - PubMed

LinkOut - more resources