A near-tight lower bound on the density of forward sampling schemes

Bryce Kille¹, Ragnar Groot Koerkamp², Drake McAdams¹, Alan Liu¹, Todd J Treangen^{1

3}

Affiliations

¹ Department of Computer Science, Rice University, Houston, TX 77005, United States.
² Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.
³ Ken Kennedy Institute, Rice University, Houston, TX 77005, United States.

PMID: 39666942
PMCID: PMC11676336
DOI: 10.1093/bioinformatics/btae736

A near-tight lower bound on the density of forward sampling schemes

Bryce Kille et al. Bioinformatics. 2024.

. 2024 Dec 26;41(1):btae736.

doi: 10.1093/bioinformatics/btae736.

Authors

Bryce Kille¹, Ragnar Groot Koerkamp², Drake McAdams¹, Alan Liu¹, Todd J Treangen^{1

3}

Affiliations

¹ Department of Computer Science, Rice University, Houston, TX 77005, United States.
² Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.
³ Ken Kennedy Institute, Rice University, Houston, TX 77005, United States.

PMID: 39666942
PMCID: PMC11676336
DOI: 10.1093/bioinformatics/btae736

Abstract

Motivation: Sampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e. have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two.

Results: We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we observe that our bound is tight when k≡1(mod w). For large w and k, the bound can be approximated by 1w+k⌈w+kw⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al. is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k≡1(mod w) and the alphabet size σ goes to ∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound.

Availability and implementation: Minimizer implementations: github.com/RagnarGrootKoerkamp/minimizers ILP and analysis: github.com/treangenlab/sampling-scheme-analysis.

PubMed Disclaimer

Figures

**Figure 1.**
(a) A De Bruijn graph B₃ corresponding to a minimum density $(w = 2, k = 2)$ -forward scheme. The underlined characters in each vertex represent the 2-mer that is selected for that window. The solid edges represent the charged contexts and the edge colors represent the pure cycles in B₄ (not in B₃ itself). For characters *c_i*, each edge $(c_{0} c_{1} c_{2}, c'_{0} c'_{1} c'_{2})$ in B₃ corresponds to the vertex $c_{0} c_{1} c_{2} c'_{2}$ in B₄. (b) The corresponding $(w = 2, ℓ = 4)$ -UHS in B₄. The vertices are partitioned by color, representing the pure-cycles. The 2-mer(s) selected in each context are underlined. The vertices with a double border represent the charged edges in B₃ in (a) and the corresponding (2, 4)-UHS. Each pure cycle c has at least $⌈ | c | / w ⌉$ vertices in the UHS.

**Figure 2.**
Comparison of forward scheme lower bounds and optimal densities for small w, k, and σ. Optimal densities were obtained via the ILP and are plotted as black circles that are solid when the optimal density matches our lower bound, $g'_{σ}$ , and hollow otherwise. Each column corresponds to a parameter being fixed to the lowest non-trivial value, i.e., σ = 2 in the first column, w = 2 in the second column, and k = 1 in the third column. Note that the x-axis in the third column corresponds to w, not k.

**Figure 3.**
Comparison of existing schemes to lower bounds with practical parameters. Densities are calculated by applying each scheme to a random sequence of 10 million characters over an alphabet of size σ = 4 (dotted lines) and are compared with the corresponding proportion of sampled k-mers on the human Y chromosome (Rhie *et al.* 2023) (soft lines). The mod-minimizer uses parameter r = 4, and miniception uses parameter $\max (4, k - w)$ . The window sizes 5 and 19 are the default window sizes for Kraken2 (Wood *et al.* 2019) and minimap2 (-ax hifi) (Li 2018), respectively. For SSHash, w = 12 was the window size used when indexing the human genome (Pibiri 2022).

See this image and copyright information in PMC

Update of

A near-tight lower bound on the density of forward sampling schemes.
Kille B, Koerkamp RG, McAdams D, Liu A, Treangen TJ. Kille B, et al. bioRxiv [Preprint]. 2024 Nov 19:2024.09.06.611668. doi: 10.1101/2024.09.06.611668. bioRxiv. 2024. Update in: Bioinformatics. 2024 Dec 26;41(1):btae736. doi: 10.1093/bioinformatics/btae736. PMID: 39605515 Free PMC article. Updated. Preprint.

References

1. DeBlasio D, Gbosibo F, Kingsford C. et al. Practical universal k-mer sets for minimizer schemes. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Niagara Falls, NY, USA. New York, NY, USA: Association for Computing Machinery, 2019, 167–76. 10.1145/3307339.3342144 - DOI
1. Edgar R. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 2021;9:e10805. 10.7717/peerj.10805 - DOI - PMC - PubMed
1. Ekim B, Berger B, Orenstein Y.. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. In: International Conference on Research in Computational Molecular Biology, Padua, Italy. Cham, Switzerland: Springer International Publishing, 2020, 37–53. 10.1007/978-3-030-45257-5_3 - PMC - PubMed
1. Golan S, Tziony I, Kraus M. et al. Generating low-density minimizers. bioRxiv, 2024. 10.1101/2024.10.28.620726, preprint: not peer reviewed. - DOI
1. Groot Koerkamp R, Pibiri GE. The mod-minimizer: a simple and efficient sampling algorithm for long k-mers. In: Pissis SP, Sung WK (eds), 24th International Workshop on Algorithms in Bioinformatics (WABI 2024), volume 312 of Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl—Leibniz-Zentrum für Informatik, 2024, 11:1–11:23. ISBN 978-3-95977-340-9. 10.4230/LIPIcs.WABI.2024.11. - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A near-tight lower bound on the density of forward sampling schemes

Affiliations

A near-tight lower bound on the density of forward sampling schemes

Authors

Affiliations

Abstract

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous