Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jan 8:2025.01.02.631161.
doi: 10.1101/2025.01.02.631161.

Fast and flexible minimizer digestion with digest

Affiliations

Fast and flexible minimizer digestion with digest

Alan Zheng et al. bioRxiv. .

Update in

Abstract

Minimizer digestion is an increasingly common component of bioinformatics tools, including tools for De Bruijn-Graph assembly and sequence classification. We describe a new open source tool and library to facilitate efficient digestion of genomic sequences. It can produce digests based on the related ideas of minimizers, modimizers or syncmers. Digest uses efficient data structures, scales well to many threads, and produces digests with expected spacings between digested elements. Digest is implemented in C++17 with a Python API, and is available open-source at https://github.com/VeryAmazed/digest.

Keywords: digestion; minimizers; sequence analysis; syncmers.

PubMed Disclaimer

Conflict of interest statement

Competing interests All authors contributed to and reviewed the manuscript. No competing interest is declared.

Figures

Fig. 1:
Fig. 1:
(A) Comparison of min query speed for different data structures as a function of window size. In this benchmark, each data-structure performs 10 million queries on an array of uniformly distributed 32-bit hash values. (B) Shows the throughput of the different digestion schemes in Digest (using segment tree data-structure) when computing the digest of a 62M human chromosome Y sequence consisting of only A/C/G/T characters. Benchmarking for both (A) and (B) were performed on a 48-core 3 GHz Intel Xeon Gold Cascade Lake 6248R CPU with 192GB RAM.

References

    1. Ahmed O. Y., Rossi M., Gagie T., Boucher C., and Langmead B.. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol., 24(1):122, May 2023. - PMC - PubMed
    1. Edgar R.. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 9 (e10805):e10805, Feb. 2021. - PMC - PubMed
    1. Ekim B., Berger B., and Chikhi R.. Minimizer-space de bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Syst., 12(10):958–968.e6, Oct. 2021. - PMC - PubMed
    1. Kazemi P., Wong J., Nikoli V. ć, H. Mohamadi, R. L. Warren, and I. Birol. nthash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics, 38(20):4812–4813, Oct. 2022. - PMC - PubMed
    1. Kille B., Groot Koerkamp R., McAdams D., Liu A., and Treangen T.. A near-tight lower bound on the density of forward sampling schemes. bioRxiv, pages 2024–09, 2024. - PMC - PubMed

Publication types

LinkOut - more resources