Review

. 2024 Oct 14;25(1):270.

doi: 10.1186/s13059-024-03414-4.

When less is more: sketching with minimizers in genomics

Malick Ndiaye^#¹, Silvia Prieto-Baños^#^{2

3}, Lucy M Fitzgerald^#², Ali Yazdizadeh Kharrazi², Sergey Oreshkov⁴, Christophe Dessimoz^{2

3}, Fritz J Sedlazeck⁵, Natasha Glover^{2

3}, Sina Majidian^{6

7}

Affiliations

¹ Department of Fundamental Microbiology, UNIL, Lausanne, Switzerland.
² Department of Computational Biology, UNIL, Lausanne, Switzerland.
³ SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
⁴ Department of Endocrinology, Diabetology, Metabolism, CHUV, Lausanne, Switzerland.
⁵ Baylor College of Medicine, Houston, USA.
⁶ Department of Computational Biology, UNIL, Lausanne, Switzerland. sina.majidian@unil.ch.
⁷ SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland. sina.majidian@unil.ch.

^# Contributed equally.

PMID: 39402664
PMCID: PMC11472564
DOI: 10.1186/s13059-024-03414-4

Review

When less is more: sketching with minimizers in genomics

Malick Ndiaye et al. Genome Biol. 2024.

. 2024 Oct 14;25(1):270.

doi: 10.1186/s13059-024-03414-4.

Authors

Affiliations

¹ Department of Fundamental Microbiology, UNIL, Lausanne, Switzerland.
² Department of Computational Biology, UNIL, Lausanne, Switzerland.
³ SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
⁴ Department of Endocrinology, Diabetology, Metabolism, CHUV, Lausanne, Switzerland.
⁵ Baylor College of Medicine, Houston, USA.
⁶ Department of Computational Biology, UNIL, Lausanne, Switzerland. sina.majidian@unil.ch.
⁷ SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland. sina.majidian@unil.ch.

^# Contributed equally.

PMID: 39402664
PMCID: PMC11472564
DOI: 10.1186/s13059-024-03414-4

Abstract

The exponential increase in sequencing data calls for conceptual and computational advances to extract useful biological insights. One such advance, minimizers, allows for reducing the quantity of data handled while maintaining some of its key properties. We provide a basic introduction to minimizers, cover recent methodological developments, and review the diverse applications of minimizers to analyze genomic data, including de novo genome assembly, metagenomics, read alignment, read correction, and pangenomes. We also touch on alternative data sketching techniques including universal hitting sets, syncmers, or strobemers. Minimizers and their alternatives have rapidly become indispensable tools for handling vast amounts of data.

PubMed Disclaimer

Conflict of interest statement

FJS received support from PacBio, ONT, and Illumina. CD has been providing consulting services for Pacific Biosciences, Inc. All other authors declare that they have no competing interests.

Figures

**Fig. 1**
The concept and applications of minimizers. K-mers are fixed-length substrings of a sequence and are used to analyze genomic sequences. The minimizer approach reduces the computational requirements by selecting only a representative k-mer from a group of adjacent k-mers. Minimizers are useful in a diverse range of applications in bioinformatics and computational biology including read alignment, read correction, de Bruijn graph representation, genome assembly, pangenomics, metagenomics classification and assembly, and beyond

**Fig. 2**
Implementation of two minimizer schemes with differing w values (left and right) for two sequences with an exact match of length 8 shown in blue underline. The parameter k = 3 and the ordering (lexicographic) are constant. The length of each sequence is |S|= 11 having 9 ( =|S|− k + 1) k-mers. Each box represents the window with size w (6 or 8), corresponding to the starting positions of the window’s k-mers which covers w + k − 1 bases (8 or 10, respectively). For sequence 1, using w = 6, the selected minimizer in the first window (partly covering the underlined exact match) is ACT starting at position 2. The same minimizer, ACT, is also selected for sequence 2 using w = 6. Since the exact match length is 8 (≥ w + k − 1), the first property of minimizers schemes is fulfilled, and the same minimizer is chosen for both sequences, representing the exact match. However, when using w = 8 (right), the match length is < w + k − 1. Thus, there is no guarantee of sharing a minimizer and a different k-mer is chosen for each sequence in this example. Note that for the second window in sequence 2, we break the tie between ACTs starting at position 2 and 7 with the leftmost position; this happens for both w = 6 andw = 8. The density of the minimizers scheme for sequence 1 using w = 6 is 2/9, as two minimizers are chosen in total: ACT (position 2, for the two first windows) and ACC (position 7, for the two last windows), and the density for sequence 2 is also 2/9 using w = 6. With w = 8, the density for both sequences is 1/9

**Fig. 3**
An example implementation of a minimizers scheme using a hash function for ordering. In this case, the hash function calculates the remainder of the values assigned to each k-mer divided by 13. The k-mer with the lowest hash value in a window is selected as the minimizer. For the last window, we break the tie between ⁷ACC and ⁹CTT with hash value of 5, by selecting the one starting at the leftmost position resulting in ⁷ACC

**Fig. 4**
Application of minimizers in read alignment. A typical read aligner that follows the seed-chain-align approach first finds reference minimizers and stores them in a hash table. Seeds are substrings (minimizers) from the reference or the read. Seeds that match between the read and the reference are called anchors, which are found by querying the read minimizers in the hash table. Then, anchors are chained together and finally bases are aligned

**Fig. 5**
Implementation of minimizers in the construction and compaction of de Bruijn graphs (dBGs). Traditionally, a dBG is a directed graph where the edges are represented by all distinct k-mers extracted from the input reads. Nodes within this graph correspond to the *k-1* suffixes and prefixes of the k-mers which are connected by edges if they are in a k-mer. To optimize dBG construction, *MBG* and *ntJoin* employ minimizers as nodes, connecting adjacent minimizers with edges. Similarly, *LJA* incorporates “splits” as edges representing substrings between pairs of consecutive minimizers in the input reads. *rust-mdBG* utilizes tuples of k′ minimizers as nodes (k′ = 3 in this example), connecting nodes with overlaps of *k′-1*. Following graph construction, compaction is crucial for reducing dBG size for efficient memory storage. *BCALM2* and *Bifrost* leverage minimizers to parallelize graph compaction. *BCALM2* categorizes k-mers into disk-buckets based on suffix and prefix minimizers, while *Bifrost* adds k-mers to a blocked bloom filter according to the hash value of their minimizer. These data structures enable the parallel inference of maximal unitigs, enhancing the overall efficiency of the compaction process

See this image and copyright information in PMC

References

1. Monaco A, Pantaleo E, Amoroso N, Lacalamita A, Lo Giudice C, Fonzino A, et al. A primer on machine learning techniques for genomic applications. Comput Struct Biotechnol J. 2021;19:4345–59. - PMC - PubMed
1. Harrison PW, Ahamed A, Aslam R, Alako BTF, Burgin J, Buso N, et al. The european nucleotide archive in 2020. Nucleic Acids Res. 2021;49:D82–5. - PMC - PubMed
1. Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research. 2021;31:1–12. 10.1101/gr.260604.119. - PMC - PubMed
1. Lewin HA, Richards S, Lieberman Aiden E, Allende ML, Archibald JM, Bálint M, et al. The Earth BioGenome Project 2020: starting the clock. Proc Natl Acad Sci U S A. 2022;119. 10.1073/pnas.2115635118. - PMC - PubMed
1. Sunagawa S, Acinas SG, Bork P, Bowler C, Tara Oceans Coordinators, Eveillard D, et al. Tara Oceans: towards global ocean ecosystems biology. Nat Rev Microbiol. 2020;18:428–45. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

When less is more: sketching with minimizers in genomics

Affiliations

When less is more: sketching with minimizers in genomics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources