Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Sep 13;20(1):199.
doi: 10.1186/s13059-019-1809-x.

When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

Affiliations
Review

When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

Will P M Rowe. Genome Biol. .

Abstract

Considerable advances in genomics over the past decade have resulted in vast amounts of data being generated and deposited in global archives. The growth of these archives exceeds our ability to process their content, leading to significant analysis bottlenecks. Sketching algorithms produce small, approximate summaries of data and have shown great utility in tackling this flood of genomic data, while using minimal compute resources. This article reviews the current state of the field, focusing on how the algorithms work and how genomicists can utilize them effectively. References to interactive workbooks for explaining concepts and demonstrating workflows are included at https://github.com/will-rowe/genome-sketching .

PubMed Disclaimer

Conflict of interest statement

The author declares that he has no competing interests.

Figures

Fig. 1
Fig. 1
a Sketching applied to a genomic data stream. The genomic data stream is viewed via a window; the window size may be equivalent to the length of a sequence read, a genome or some other arbitrary length. The sequence within the window is decomposed into a set of constituent k-mers; each k-mer can be evaluated against its reverse complement to keep only the canonical k-mer. As k-mers are generated, they are sketched and the sketch data structure may be updated. The sketch can be evaluated and allow feedback to the data stream process. b Common sketching algorithms applied to a single k-mer from a set, using example parameters. MinHash KHF: the k-mer is hashed by three functions, giving three values (green, blue, purple). The number of hash functions corresponds to the length of the sketch. Each value is evaluated against the corresponding position in the sketch; i.e. green compared against the first value, blue against the second, and purple against the third. The sketch updates with any new minimum; e.g. the blue value is smaller than the existing one in this position (3 < 66), so replaces it. Bloom filter: the k-mer is hashed by two functions; giving two values (red and orange). The output range of the hash functions corresponds to the length of the sketch, here 0–3. The hash values are used to set bits to 1 at the corresponding positions. CountMin sketch: the k-mer is hashed by two functions; giving two values (red and brown). The number of functions corresponds to a row in the sketch, here 0 or 1, and the output range of the functions corresponds to the length of the rows, here 0–2. So the first hash value (red) gives matrix position 0,0 and the second gives 1,1. The counters held at these positions in the matrix are incremented. HyperLogLog: the k-mer is hashed by one function; giving a single value (10011). The prefix (brown) corresponds to a register, and the suffix (blue) corresponds to the bit-pattern observable. The suffix is compared to the existing value in register 1, is found to have more leading zeros and so replaces the existing value in register 1

Similar articles

Cited by

References

    1. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132. doi: 10.1186/s13059-016-0997-x. - DOI - PMC - PubMed
    1. Baker DN, Langmead B. Dashing: fast and accurate genomic distances with HyperLogLog. bioRxiv. 2019:501726. - PMC - PubMed
    1. Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT. These are not the K-mers you are looking for: efficient online K-mer counting using a probabilistic data structure. PLoS One. 2014;9:e101271. doi: 10.1371/journal.pone.0101271. - DOI - PMC - PubMed
    1. Rowe WPM, Carrieri AP, Alcon-Giner C, Caim S, Shaw A, Sim K, et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019;7:40. doi: 10.1186/s40168-019-0653-2. - DOI - PMC - PubMed
    1. Cormode G. Data sketching. Commun ACM. 2017;60:48–55. doi: 10.1145/3080008. - DOI

Publication types

LinkOut - more resources