Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 May 27:2025.05.22.655637.
doi: 10.1101/2025.05.22.655637.

Movi Color: fast and accurate long-read classification with the move structure

Affiliations

Movi Color: fast and accurate long-read classification with the move structure

Steven Tan et al. bioRxiv. .

Abstract

The number of reference genomes is rapidly increasing, thanks to advances in long-read sequencing and assembly. While these collections can improve the sensitivity and specificity of classification methods, this requires highly efficient compressed indexes. K-mer-based approaches like Kraken 2 are efficient but limit the analysis to a fixed k-mer length. This is hard for the user to set ahead of time, and suboptimal settings can harm sensitivity and specificity. Methods that use compressed full-text indexes like SPUMONI2 and Cliffy lift this constraint, but are less efficient than k-mer-based tools. Further, these methods either cannot report a full listing of genomes where a match occurs, or cannot scale to large reference databases. We propose new methods and algorithms that use compressed full-text indexes to enable multi-class and taxonomic classification. Unlike past compressed-indexing methods for classification, ours uses the move structure, which is extremely fast thanks to its locality of reference. Our method, called Movi Color, augments the main table of the Movi index. Specifically, Movi Color assigns a "color" to each run of the Burrows-Wheeler Transform according to the subset of genomes from which the run suffixes originated. When the reference is highly repetitive - as is typical when indexing pangenomes or reference databases - only certain colors occur, creating opportunities to compress the index. For species-level classification, Movi Color achieves over 1.6× higher precision and 2× higher recall than Kraken 2 and Metabuli. At the genus level, it achieves 70% higher precision and 80% higher recall. Movi Color's read processing time is 7-20× faster than Metabuli and is a comparable to Kraken 2. Although Movi Color uses more memory than both Kraken 2 and Metabuli, its speed-accuracy trade-off makes it well-suited for real-time or high-throughput scenarios.

Keywords: Applied computing; Comparative genomics; Compressed indexing; Computational genomics; Pangenomics.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(a) The documents and the concatenated text used for building the BWM. The red dotted lines are the document boundaries in the concatenated text. (b) The BWT entries with the corresponding suffixes and suffix array (SA) entries. The document ID at which the suffix starts is assigned based on the SA entry and the document boundaries in the concatenated text. The color of a run is the set of all the documents from the BWT offsets in the run. Each separate color is assigned a unique ID. The color ID column (indicated by the red box) is stored in the Movi index as a new column in the main table. (c) The color table which maps a color ID to a set of documents representing the color is stored separately in the index. (d) The Movi Color table which is the same as the Movi table with an additional column to store the color ID for each run.
Figure 2
Figure 2
An example to compare backward-search and PML computation algorithms side by side. The read R=AGCC is being searched to the index starting at the right most end. The backward search maintains a BWT interval (BWT[i..j]) which contains all the possible exact matches. The offsets in the BWT interval are showed by green at each steps. The shaded green characters are the ones that are matched by the backward search until that step. The pink character is the character queried at each step. The purple box shows the offset tracked by the PML query at each step. When the character of the BWT offset matches the read’s character, we have case 1, otherwise it is called a case 2. The PML length is incremented by one for each case 1 and is reset to 0 by case 2. The repositioning is performed after each case 2 to find a BWT offset that matches the read’s character. The bottom row of the table shows the colors identified by the tracked BWT offset during PML query. See Figure 1 for the definition of colors.
Figure 3
Figure 3
Classification accuracies of Movi Color compared to Metabuli and Kraken 2

Similar articles

References

    1. Agustinho Daniel P, Fu Yilei, Menon Vipin K, Metcalf Ginger A, Treangen Todd J, and Sedlazeck Fritz J. Unveiling microbial diversity: harnessing long-read sequencing technology. Nature methods, 21(6):954–966, 2024. - PMC - PubMed
    1. Ahmed O., Rossi M., Kovaka S., Schatz M. C., Gagie T., Boucher C., and Langmead B.. Pan-genomic matching statistics for targeted nanopore sequencing. iScience, 24(6):102696, Jun 2021. - PMC - PubMed
    1. Ahmed Omar, Boucher Christina, and Langmead Ben. Cliffy: robust 16s rrna classification based on a compressed lca index. bioRxiv, pages 2024–05, 2024.
    1. Ahmed Omar Y, Rossi Massimiliano, Gagie Travis, Boucher Christina, and Langmead Ben. Spumoni 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023. - PMC - PubMed
    1. Almodaresi Fatemeh, Pandey Prashant, Ferdman Michael, Johnson Rob, and Patro Rob. An efficient, scalable, and exact representation of high-dimensional color information enabled using de bruijn graph search. Journal of Computational Biology, 27(4):485–499, 2020. - PMC - PubMed

Publication types

LinkOut - more resources