Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Dec;30(12):1251-1276.
doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

Creating and Using Minimizer Sketches in Computational Genomics

Affiliations
Review

Creating and Using Minimizer Sketches in Computational Genomics

Hongyu Zheng et al. J Comput Biol. 2023 Dec.

Abstract

Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.

Keywords: de Bruijn graphs; k-mer counting; minimizers; read mapping; sketching.

PubMed Disclaimer

Conflict of interest statement

C.K. is a cofounder and CEO of Ocean Genomics, Inc. G.M. is VP of Software Development at Ocean Genomics, Inc. The other author has no conflicting financial interests.

Figures

FIG. 1.
FIG. 1.
Example minimizer sketch with w=5,k=3 and O being the lexicographical order. Left-hand side: the sequence S=AACGTCGATCCG at the top, each line below is a window of S with selected k-mer in red. Right-hand side: Resulting sketch of S.
FIG. 2.
FIG. 2.
Comparing two minimizer sketches that are identical in w,k and S, only differing in the lexicographical ordering O. Left: A<C<G<T, identical to Figure 1. Middle: T<G<C<A and all other parameters intact. Right: Resulting sketches for both setups.
FIG. 3.
FIG. 3.
Example for calculating preservation, with setup identical to that in Figure 1. Left-hand side: The original sequence S, its windows, and selected minimizer k-mers. Right-hand side: The mutated sequence S (the single mutated base is marked in purple), its windows, and selected minimizer k-mers (those different from S are marked in purple). The preservation rate is 2/11.
FIG. 4.
FIG. 4.
Example of using minimizer sketch for read mapping. Left-hand side presents the original sequence, the original read, and the set of potential mappings. Right-hand side presents their minimizer k-mers as colored blocks, and the set of potential mappings that has at least a minimizer match. The second mapping has two minimizer matches, and is usually considered the mapping with highest quality.
FIG. 5.
FIG. 5.
Example of minimizer-assisted k-mer counting. For counting 5-mers, we use k0=2 and lexicographical order, implying w0=4. Left-top shows the sequence S and its windows (5-mers) grouped into super-k-mers by shared minimizer. Super-k-mers are sent to buckets and the results of 5-mer counting are shown in each bucket. The final tabulation is obtained by concatenation.

References

    1. Ahmed OY, Rossi M, Gagie T, et al. . SPUMONI 2: Improved classification using a pangenome index of minimizer digests. Genome Biol 2023;24:122; doi: 10.1186/s13059-023-02958-1 - DOI - PMC - PubMed
    1. Almutairy M, Torng E. Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches. PLoS One 2018;13(2):e0189960; doi: 10.1371/journal.pone.0189960 - DOI - PMC - PubMed
    1. Baharav TZ, Kamath GM, David NT, et al. . Spectral Jaccard similarity: A new approach to estimating pairwise sequence alignments. Patterns 2020;1(6):100081; doi: 10.1016/j.patter.2020.100081 - DOI - PMC - PubMed
    1. Bankevich A, Bzikadze AV, Kolmogorov M, et al. . Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol 2022;40(7):1075–1081; doi: 10.1038/s41587-022-01220-6 - DOI - PubMed
    1. Belbasi M, Blanca A, Harris RS, et al. . The minimizer Jaccard estimator is biased and inconsistent. Bioinformatics 2022;38(Suppl 1):i169–i176; doi: 10.1093/bioinformatics/btac244 - DOI - PMC - PubMed

Publication types

LinkOut - more resources