Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 17;20(1):4.
doi: 10.1186/s13015-025-00270-0.

The open-closed mod-minimizer algorithm

Affiliations

The open-closed mod-minimizer algorithm

Ragnar Groot Koerkamp et al. Algorithms Mol Biol. .

Abstract

Sampling algorithms that deterministically select a subset of k -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one k -mer out of every window of w consecutive k -mers. The folklore and most used scheme is the random minimizer that selects the smallest k -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected k -mers) of 2 / ( w + 1 ) . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when k , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when k w . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small k w while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when k > w is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.

Keywords: Hashing; Minimizers; Randomized algorithms; Sketching.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare that they do no have any competing interests.

Figures

Fig. 1
Fig. 1
An illustration of the behavior of the mod-minimizer for w+k-1=34, and t=4. Rows indicate consecutive windows. The thick outlined boxes mark the minimal t-mer in each window and the regions highlighted in red indicate the sampled k-mer
Algorithm 1
Algorithm 1
Pseudocode for the random minimizer algorithm.
Algorithm 2
Algorithm 2
Pseudocode for the miniception (left) and the open-closed (“OC”, right) minimizer methods. The differences are highlighted in blue.
Fig. 2
Fig. 2
Example of finding all open (blue) and closed (red) syncmers in a context of size w+k, for s=3, k=7, and w=6. One of the closed syncmers is charged because it is the rightmost k-mer
Algorithm 3
Algorithm 3
Pseudocode to compute the density of the open-closed (mod-) minimizer in polynomial time. For the open-closed minimizer, simply set t=k. To compute the density of the closed minimizer (a.k.a., miniception) or open minimizer, ignore the if statements at line 6 or 8.
Fig. 3
Fig. 3
Density for w=24 and varying k, measured on a random string of ten million i.i.d. random characters for σ=4. For the methods OC, O, and C, we use s=4 for the solid lines. The dashed lines, instead, use the best choice of s
Algorithm 4
Algorithm 4
Pseudocode for the mod-sampling and the extended mod-minimizer methods. For the extended mod-minimizer, A:Σw+k-1[w+k-t] is any sampling scheme for parameters w+k-t and t used to define the anchor. We assume that A defines an order between t-mers and that its definition might use additional parameters to define the sampling, like st in case A is the open-closed minimizer from Sect. The open-closed minimizer. We slightly abuse notation and call A as A(W,w,k).
Fig. 4
Fig. 4
Example of the open-closed mod-minimizer for s=3, t=7, k=15, and w=8
Fig. 5
Fig. 5
Density for w=24 and by varying k, measured on a random string of ten million i.i.d. random characters for σ=4. We use r=4 for all methods. For the methods mod-OC, mod-O, and mod-C, we use s=4 for the solid lines. The dashed lines, instead, use the best choice of s
Fig. 6
Fig. 6
Density for w=24 and by varying k, measured on a random string of ten million i.i.d. random characters for σ=256. For all methods that require the parameter s, we use s=1 for the solid lines. The dashed lines, instead, use the best choice of s. We use r=1 for all methods

References

    1. Ndiaye M, Prieto-Baños S, Fitzgerald LM, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C, Sedlazeck FJ, Glover N, Majidian S. When less is more: sketching with minimizers in genomics. Genome Biol. 2024;25(1):270. 10.1186/s13059-024-03414-4. - PMC - PubMed
    1. Zheng H, Marçais G, Kingsford C. Creating and using minimizer sketches in computational genomics. J Comput Biol. 2023;30(12):1251–76. 10.1089/cmb.2023.0094. - PMC - PubMed
    1. Pibiri GE. Sparse and skew hashing of formula image-mers. Bioinformatics. 2022;38:185–194. 10.1093/bioinformatics/btac245. - PMC - PubMed
    1. Pibiri GE. On weighted formula image-mer dictionaries. Algorithms Mol Biol. 2023;18(1):1–20. 10.1186/s13015-023-00226-2. - PMC - PubMed
    1. Marchet C, Kerbiriou M, Limasset A. BLight: efficient exact associative structure for formula image-mers. Bioinformatics. 2021;37(18):2858–65. 10.1093/bioinformatics/btab217. - PubMed

LinkOut - more resources