HyperMinHash: MinHash in LogLog space
- PMID: 38288326
- PMCID: PMC10824537
- DOI: 10.1109/tkde.2020.2981311
HyperMinHash: MinHash in LogLog space
Abstract
In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size to buckets of size by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error on a Jaccard index , given a random oracle, HyperMinHash needs space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of with the same memory consumption.
Keywords: compression; hyperloglog; min-wise hashing; sketching; streaming.
Figures
References
-
- Broder AZ, “On the resemblance and containment of documents,” in Compression and Complexity of Sequences 1997. Proceedings. IEEE, 1997, pp. 21–29.
-
- Li P and Church KW, “Using sketches to estimate associations,” in Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005, pp. 708–715.
-
- Jaccard P, “Lois de distribution florale dans la zone alpine,” Bull Soc Vaudoise Sci Nat, vol. 38, pp. 69–130, 1902.
-
- Bar-Yossef Z, Jayram T, Kumar R, Sivakumar D, and Trevisan L, “Counting distinct elements in a data stream,” in International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 2002, pp. 1–10.
-
- Durand M and Flajolet P, “Loglog counting of large cardinalities,” in European Symposium on Algorithms. Springer, 2003, pp. 605–617.
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous