Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2019 Dec 4;20(1):265.
doi: 10.1186/s13059-019-1875-0.

Dashing: fast and accurate genomic distances with HyperLogLog

Affiliations
Comparative Study

Dashing: fast and accurate genomic distances with HyperLogLog

Daniel N Baker et al. Genome Biol. .

Abstract

Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.

Keywords: Alignment; Genomic distance; Hyperloglog; Metagenomics; Sequencing; Sketch data structures.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Jaccard-coefficient estimation error using HLL and MinHash. Left column shows experiments with the true Jaccard coefficient fixed at 0.111. Right column shows the same for a coefficient of 0.0465. x axis shows the log2 of the size of the sketch data structure in bytes. y axis shows the absolute error of the JACCARD-COEFFICIENT estimate. The second row zooms further in with respect to the y-axis. Colors indicate the input set sizes, and each pair of inputs differs in size by a factor of 23=8
Fig. 2
Fig. 2
Estimated versus true Jaccard coefficients (Js) for various methods across a range of true J. Each point is one pair from an overall set of 400 pairs of genomes, selected to evenly cover the range of true Js
Fig. 3
Fig. 3
Computational efficiency of Mash, BinDash and Dashing. Results for k=21, k=31 and sketches of size 210 (1 kb), 212 (4 kb), 214 (16 kb), and 216 (64 kb). “Both” results obtained either by using a combined Sketch+Distance mode (for Dashing) or by combining results from separate sketching and distance-calculation invocations (for Mash and BinDash). Dashing was assessed using three estimation methods: Flajolet’s method using the harmonic mean (“Orig”) and Ertl’s MLE and JMLE methods
Fig. 4
Fig. 4
a Relationship between maximum leading zero count (Max LZC) and set size for three randomly-generated sets of 8-bit numbers. The Max LZC roughly estimates the log2 of the set size, though with high variance; here, two of three estimates are off by 2-fold. b Schematic of HyperLogLog sketch. Input items are hashed and hash value is partitioned into prefix p and suffix q. p indexes into the array of HLL registers. A register contains the maximum leading zero count among all suffixes q that mapped there. Register-level estimates are then combined to obtain an overall cardinality estimate. c Estimating cardinalities of sets A and B, and d estimating the cardinality of their union. For intersection cardinalities using inclusion-exclusion principle, estimated set and union cardinalities are combined. e Direct estimation of intersection cardinality with Ertl’s JMLE

References

    1. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17(1):132. - PMC - PubMed
    1. Schaeffer L, Pimentel H, Bray N, Melsted P, Pachter L. Pseudoalignment for metagenomic read assignment. Bioinformatics. 2017;33(14):2082–8. - PMC - PubMed
    1. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27(5):722–36. - PMC - PubMed
    1. Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015;33(6):623–30. - PubMed
    1. Jain C, Koren S, Dilthey A, Phillippy AM, Aluru S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics. 2018;34(17):748–56. - PMC - PubMed

Publication types

LinkOut - more resources