Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 26;41(1):btae736.
doi: 10.1093/bioinformatics/btae736.

A near-tight lower bound on the density of forward sampling schemes

Affiliations

A near-tight lower bound on the density of forward sampling schemes

Bryce Kille et al. Bioinformatics. .

Abstract

Motivation: Sampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e. have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two.

Results: We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we observe that our bound is tight when k≡1(mod w). For large w and k, the bound can be approximated by 1w+k⌈w+kw⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al. is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k≡1(mod w) and the alphabet size σ goes to ∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound.

Availability and implementation: Minimizer implementations: github.com/RagnarGrootKoerkamp/minimizers ILP and analysis: github.com/treangenlab/sampling-scheme-analysis.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(a) A De Bruijn graph B3 corresponding to a minimum density (w=2,k=2)-forward scheme. The underlined characters in each vertex represent the 2-mer that is selected for that window. The solid edges represent the charged contexts and the edge colors represent the pure cycles in B4 (not in B3 itself). For characters ci, each edge (c0c1c2,c0c1c2) in B3 corresponds to the vertex c0c1c2c2 in B4. (b) The corresponding (w=2,=4)-UHS in B4. The vertices are partitioned by color, representing the pure-cycles. The 2-mer(s) selected in each context are underlined. The vertices with a double border represent the charged edges in B3 in (a) and the corresponding (2, 4)-UHS. Each pure cycle c has at least |c|/w vertices in the UHS.
Figure 2.
Figure 2.
Comparison of forward scheme lower bounds and optimal densities for small w, k, and σ. Optimal densities were obtained via the ILP and are plotted as black circles that are solid when the optimal density matches our lower bound, gσ, and hollow otherwise. Each column corresponds to a parameter being fixed to the lowest non-trivial value, i.e., σ = 2 in the first column, w =2 in the second column, and k =1 in the third column. Note that the x-axis in the third column corresponds to w, not k.
Figure 3.
Figure 3.
Comparison of existing schemes to lower bounds with practical parameters. Densities are calculated by applying each scheme to a random sequence of 10 million characters over an alphabet of size σ = 4 (dotted lines) and are compared with the corresponding proportion of sampled k-mers on the human Y chromosome (Rhie et al. 2023) (soft lines). The mod-minimizer uses parameter r =4, and miniception uses parameter max(4,kw). The window sizes 5 and 19 are the default window sizes for Kraken2 (Wood et al. 2019) and minimap2 (-ax hifi) (Li 2018), respectively. For SSHash, w =12 was the window size used when indexing the human genome (Pibiri 2022).

Update of

References

    1. DeBlasio D, Gbosibo F, Kingsford C. et al. Practical universal k-mer sets for minimizer schemes. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Niagara Falls, NY, USA. New York, NY, USA: Association for Computing Machinery, 2019, 167–76. 10.1145/3307339.3342144 - DOI
    1. Edgar R. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 2021;9:e10805. 10.7717/peerj.10805 - DOI - PMC - PubMed
    1. Ekim B, Berger B, Orenstein Y.. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. In: International Conference on Research in Computational Molecular Biology, Padua, Italy. Cham, Switzerland: Springer International Publishing, 2020, 37–53. 10.1007/978-3-030-45257-5_3 - PMC - PubMed
    1. Golan S, Tziony I, Kraus M. et al. Generating low-density minimizers. bioRxiv, 2024. 10.1101/2024.10.28.620726, preprint: not peer reviewed. - DOI
    1. Groot Koerkamp R, Pibiri GE. The mod-minimizer: a simple and efficient sampling algorithm for long k-mers. In: Pissis SP, Sung WK (eds), 24th International Workshop on Algorithms in Bioinformatics (WABI 2024), volume 312 of Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl—Leibniz-Zentrum für Informatik, 2024, 11:1–11:23. ISBN 978-3-95977-340-9. 10.4230/LIPIcs.WABI.2024.11. - DOI

MeSH terms