Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 12;20(1):15.
doi: 10.1186/s13015-025-00281-x.

b-move: faster lossless approximate pattern matching in a run-length compressed index

Affiliations

b-move: faster lossless approximate pattern matching in a run-length compressed index

Lore Depuydt et al. Algorithms Mol Biol. .

Abstract

Background: Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.

Results: We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs ϕ and ϕ - 1 operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop.

Conclusions: b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

Keywords: Approximate pattern matching; Bidirectional search; Cache efficiency; FM-index; Lossless alignment; Move structure; Pan-genomics; r-index.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics Approval and Consent to Participate: Not applicable. Consent for Publication: Not applicable. Competing Interests: The authors declare that they have no Conflict of interest.

Figures

Algorithm 1
Algorithm 1
Find LF(i)=MLF.moveStep(i,j) given tuple (i, j), where j is the run index that contains BWT index i.
Algorithm 2
Algorithm 2
Let [se] be the interval over SA corresponding to P, which we want to extend to cP. Rs and Re are the runs containing s and e, respectively. Functions walkToNextRun and walkToPreviousRun return the SA indices sc and ec (along with their run indices Rs,c and Re,c) which indicate the smallest subinterval within [se] containing all occurrences of c in the BWT.
Algorithm 3
Algorithm 3
Extend pattern P, represented by [se], corresponding to respectively run index Rs and Re, to cP. The algorithm returns the updated interval [s,e], corresponding to respectively run index Rs and Re, for cP.
Algorithm 4
Algorithm 4
Update intervals ([s,e],Rs,Re) and ([srev,erev],Rsrev,Rerev) corresponding to P, to intervals ([s,e],Rs,Re) and ([srev,erev],Rsrev,Rerev) that correspond to cP.
Algorithm 5
Algorithm 5
Update run indices using binary search to find the correct run containing index i. The algorithm uses the move table MLF upon which it is called.
Fig. 1
Fig. 1
Log scale histograms for the number of walking and fast forwarding steps required per successful bidirectional character extension (96,363,328 in total) for finding all SA intervals corresponding to all occurrences up to 3 mismatches. We aligned 100000 Illumina reads of length 151 bp and their reverse complements to the pan-genome composed of 512 human chromosome 19 haplotypes
Fig. 2
Fig. 2
Benchmark results for finding all SA intervals corresponding to all occurrences up to a certain number of mismatches (x-axis). The left panel displays the total search time, while the right panel shows the average execution time for the core bidirectional character extension functionality. We aligned 100000 Illumina reads of length 151 bp and their reverse complements to the pan-genome composed of 512 human chromosome 19 haplotypes
Fig. 3
Fig. 3
Log scale histogram for the number of fast forwarding steps required per ϕ or ϕ-1 operation (355,041,072 in total) for finding all occurrences up to 3 mismatches. We aligned 100000 Illumina reads of length 151 bp and their reverse complements to the pan-genome composed of 3264 E. coli strains, using unbalanced ϕ move tables. Using balanced ϕ move tables, the mean number of fast forwarding steps is reduced to 1.49 (not shown)
Fig. 4
Fig. 4
Benchmark results for finding all occurrences with up to a certain number of mismatches (x-axis). The left panel presents the average locate time per SA interval, while the right panel shows the average execution time for the core ϕ and ϕ-1 operations. The experiment aligns 100,000 Illumina reads of length 151 bp and their reverse complements to the pan-genome composed of 3264 E. coli strains
Fig. 5
Fig. 5
Benchmark results for approximate pattern matching using the br-index, b-move (with and without ϕ move tables), and Columba (suffix array sparseness of 8). Additionally, we include results for b-move (with ϕ move tables) with reporting functionality, which is not included in the other results. We aligned 100000 Illumina reads of length 151 bp and their reverse complements to pan-genomes for both human chromosome 19 and E. coli, across multiple sizes. We allowed for a maximum error distance of 3 (Hamming for br-index, edit for b-move and Columba)
Fig. 6
Fig. 6
Log-scale time-memory comparison for approximate pattern matching using the br-index, b-move (with and without ϕ move tables), and Columba (suffix array sparseness of 8). We show the switch point for each pan-genome, which marks the minimum pan-genome size at which the b-move index becomes smaller than Columba. We aligned 100000 Illumina reads of length 151 bp and their reverse complements to pan-genomes for both human chromosome 19 and E. coli, across multiple sizes. We allowed for a maximum error distance of 3 (Hamming for br-index, edit for b-move and Columba)

Update of

References

    1. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings Bioinform. 2016;19(1):118–35. 10.1093/bib/bbw089. - PMC - PubMed
    1. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–60. 10.1093/bioinformatics/btp324. - PMC - PubMed
    1. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. 10.1038/nmeth.1923. - PMC - PubMed
    1. Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings 41st annual symposium on Foundations of Computer Science; 2000. p. 390-8. 10.1109/SFCS.2000.892127.
    1. Burrows M, Wheeler D. A Block-sorting lossless data compression algorithm. 130 Lytton Avenue, Palo Alto, California 94301: Digital Equipment Corporation Systems Research Center; 1994. p. 124.

LinkOut - more resources