Comparative Study

. 2019 Dec 4;20(1):265.

doi: 10.1186/s13059-019-1875-0.

Dashing: fast and accurate genomic distances with HyperLogLog

Daniel N Baker¹, Ben Langmead²

Affiliations

¹ Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, 21218, USA. dnb@cs.jhu.edu.
² Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, 21218, USA. langmea@cs.jhu.edu.

PMID: 31801633
PMCID: PMC6892282
DOI: 10.1186/s13059-019-1875-0

Comparative Study

Dashing: fast and accurate genomic distances with HyperLogLog

Daniel N Baker et al. Genome Biol. 2019.

. 2019 Dec 4;20(1):265.

doi: 10.1186/s13059-019-1875-0.

Authors

Daniel N Baker¹, Ben Langmead²

Affiliations

¹ Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, 21218, USA. dnb@cs.jhu.edu.
² Department of Computer Science, Johns Hopkins University, 3400 N Charles St, Baltimore, 21218, USA. langmea@cs.jhu.edu.

PMID: 31801633
PMCID: PMC6892282
DOI: 10.1186/s13059-019-1875-0

Abstract

Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.

Keywords: Alignment; Genomic distance; Hyperloglog; Metagenomics; Sequencing; Sketch data structures.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Jaccard-coefficient estimation error using HLL and MinHash. Left column shows experiments with the true Jaccard coefficient fixed at 0.111. Right column shows the same for a coefficient of 0.0465. x axis shows the log2 of the size of the sketch data structure in bytes. y axis shows the absolute error of the JACCARD-COEFFICIENT estimate. The second row zooms further in with respect to the y-axis. Colors indicate the input set sizes, and each pair of inputs differs in size by a factor of 2³=8

**Fig. 2**
Estimated versus true Jaccard coefficients (Js) for various methods across a range of true J. Each point is one pair from an overall set of 400 pairs of genomes, selected to evenly cover the range of true Js

**Fig. 3**
Computational efficiency of Mash, BinDash and Dashing. Results for k=21, k=31 and sketches of size 2¹⁰ (1 kb), 2¹² (4 kb), 2¹⁴ (16 kb), and 2¹⁶ (64 kb). “Both” results obtained either by using a combined Sketch+Distance mode (for Dashing) or by combining results from separate sketching and distance-calculation invocations (for Mash and BinDash). Dashing was assessed using three estimation methods: Flajolet’s method using the harmonic mean (“Orig”) and Ertl’s MLE and JMLE methods

**Fig. 4**
a Relationship between maximum leading zero count (Max LZC) and set size for three randomly-generated sets of 8-bit numbers. The Max LZC roughly estimates the log2 of the set size, though with high variance; here, two of three estimates are off by 2-fold. b Schematic of HyperLogLog sketch. Input items are hashed and hash value is partitioned into prefix p and suffix q. p indexes into the array of HLL registers. A register contains the maximum leading zero count among all suffixes q that mapped there. Register-level estimates are then combined to obtain an overall cardinality estimate. c Estimating cardinalities of sets A and B, and d estimating the cardinality of their union. For intersection cardinalities using inclusion-exclusion principle, estimated set and union cardinalities are combined. e Direct estimation of intersection cardinality with Ertl’s JMLE

See this image and copyright information in PMC

References

1. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17(1):132. - PMC - PubMed
1. Schaeffer L, Pimentel H, Bray N, Melsted P, Pachter L. Pseudoalignment for metagenomic read assignment. Bioinformatics. 2017;33(14):2082–8. - PMC - PubMed
1. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27(5):722–36. - PMC - PubMed
1. Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015;33(6):623–30. - PubMed
1. Jain C, Koren S, Dilthey A, Phillippy AM, Aluru S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics. 2018;34(17):748–56. - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions

Grants and funding

R01 GM118568/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dashing: fast and accurate genomic distances with HyperLogLog

Affiliations

Dashing: fast and accurate genomic distances with HyperLogLog

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources