Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Oct 14;25(1):270.
doi: 10.1186/s13059-024-03414-4.

When less is more: sketching with minimizers in genomics

Affiliations
Review

When less is more: sketching with minimizers in genomics

Malick Ndiaye et al. Genome Biol. .

Abstract

The exponential increase in sequencing data calls for conceptual and computational advances to extract useful biological insights. One such advance, minimizers, allows for reducing the quantity of data handled while maintaining some of its key properties. We provide a basic introduction to minimizers, cover recent methodological developments, and review the diverse applications of minimizers to analyze genomic data, including de novo genome assembly, metagenomics, read alignment, read correction, and pangenomes. We also touch on alternative data sketching techniques including universal hitting sets, syncmers, or strobemers. Minimizers and their alternatives have rapidly become indispensable tools for handling vast amounts of data.

PubMed Disclaimer

Conflict of interest statement

FJS received support from PacBio, ONT, and Illumina. CD has been providing consulting services for Pacific Biosciences, Inc. All other authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The concept and applications of minimizers. K-mers are fixed-length substrings of a sequence and are used to analyze genomic sequences. The minimizer approach reduces the computational requirements by selecting only a representative k-mer from a group of adjacent k-mers. Minimizers are useful in a diverse range of applications in bioinformatics and computational biology including read alignment, read correction, de Bruijn graph representation, genome assembly, pangenomics, metagenomics classification and assembly, and beyond
Fig. 2
Fig. 2
Implementation of two minimizer schemes with differing w values (left and right) for two sequences with an exact match of length 8 shown in blue underline. The parameter k = 3 and the ordering (lexicographic) are constant. The length of each sequence is |S|= 11 having 9 ( =|S|− k + 1) k-mers. Each box represents the window with size w (6 or 8), corresponding to the starting positions of the window’s k-mers which covers w + k − 1 bases (8 or 10, respectively). For sequence 1, using w = 6, the selected minimizer in the first window (partly covering the underlined exact match) is ACT starting at position 2. The same minimizer, ACT, is also selected for sequence 2 using w = 6. Since the exact match length is 8 (≥ w + k − 1), the first property of minimizers schemes is fulfilled, and the same minimizer is chosen for both sequences, representing the exact match. However, when using w = 8 (right), the match length is < w + k − 1. Thus, there is no guarantee of sharing a minimizer and a different k-mer is chosen for each sequence in this example. Note that for the second window in sequence 2, we break the tie between ACTs starting at position 2 and 7 with the leftmost position; this happens for both w = 6 andw = 8. The density of the minimizers scheme for sequence 1 using w = 6 is 2/9, as two minimizers are chosen in total: ACT (position 2, for the two first windows) and ACC (position 7, for the two last windows), and the density for sequence 2 is also 2/9 using w = 6. With w = 8, the density for both sequences is 1/9
Fig. 3
Fig. 3
An example implementation of a minimizers scheme using a hash function for ordering. In this case, the hash function calculates the remainder of the values assigned to each k-mer divided by 13. The k-mer with the lowest hash value in a window is selected as the minimizer. For the last window, we break the tie between 7ACC and 9CTT with hash value of 5, by selecting the one starting at the leftmost position resulting in 7ACC
Fig. 4
Fig. 4
Application of minimizers in read alignment. A typical read aligner that follows the seed-chain-align approach first finds reference minimizers and stores them in a hash table. Seeds are substrings (minimizers) from the reference or the read. Seeds that match between the read and the reference are called anchors, which are found by querying the read minimizers in the hash table. Then, anchors are chained together and finally bases are aligned
Fig. 5
Fig. 5
Implementation of minimizers in the construction and compaction of de Bruijn graphs (dBGs). Traditionally, a dBG is a directed graph where the edges are represented by all distinct k-mers extracted from the input reads. Nodes within this graph correspond to the k-1 suffixes and prefixes of the k-mers which are connected by edges if they are in a k-mer. To optimize dBG construction, MBG and ntJoin employ minimizers as nodes, connecting adjacent minimizers with edges. Similarly, LJA incorporates “splits” as edges representing substrings between pairs of consecutive minimizers in the input reads. rust-mdBG utilizes tuples of k′ minimizers as nodes (k′ = 3 in this example), connecting nodes with overlaps of k′-1. Following graph construction, compaction is crucial for reducing dBG size for efficient memory storage. BCALM2 and Bifrost leverage minimizers to parallelize graph compaction. BCALM2 categorizes k-mers into disk-buckets based on suffix and prefix minimizers, while Bifrost adds k-mers to a blocked bloom filter according to the hash value of their minimizer. These data structures enable the parallel inference of maximal unitigs, enhancing the overall efficiency of the compaction process

References

    1. Monaco A, Pantaleo E, Amoroso N, Lacalamita A, Lo Giudice C, Fonzino A, et al. A primer on machine learning techniques for genomic applications. Comput Struct Biotechnol J. 2021;19:4345–59. - PMC - PubMed
    1. Harrison PW, Ahamed A, Aslam R, Alako BTF, Burgin J, Buso N, et al. The european nucleotide archive in 2020. Nucleic Acids Res. 2021;49:D82–5. - PMC - PubMed
    1. Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research. 2021;31:1–12. 10.1101/gr.260604.119. - PMC - PubMed
    1. Lewin HA, Richards S, Lieberman Aiden E, Allende ML, Archibald JM, Bálint M, et al. The Earth BioGenome Project 2020: starting the clock. Proc Natl Acad Sci U S A. 2022;119. 10.1073/pnas.2115635118. - PMC - PubMed
    1. Sunagawa S, Acinas SG, Bork P, Bowler C, Tara Oceans Coordinators, Eveillard D, et al. Tara Oceans: towards global ocean ecosystems biology. Nat Rev Microbiol. 2020;18:428–45. - PubMed

LinkOut - more resources