Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan;34(1):328-339.
doi: 10.1109/tkde.2020.2981311. Epub 2020 Mar 17.

HyperMinHash: MinHash in LogLog space

Affiliations

HyperMinHash: MinHash in LogLog space

Yun William Yu et al. IEEE Trans Knowl Data Eng. 2022 Jan.

Abstract

In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(loglogn) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error ϵ on a Jaccard index t, given a random oracle, HyperMinHash needs O(ϵ-2(loglogn+log1ϵ)) space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.

Keywords: compression; hyperloglog; min-wise hashing; sketching; streaming.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
HyperMinHash generates sketches in the same fashion as one-permutation k-partition MinHash, but stores the hashes in floating-point notation. It begins by hashing each object in the set to a uniform random number between 0 and 1 , encoded in binary. Then, the hashed values are partitioned by the first p bits (above, 2 bits, in green), and the minimum value within each partition is taken. For ordinary MinHash, the procedure stops here, and a fixed number of bits (above, 64 bits) after the first p bits are taken as the hash value. For HyperMinHash, each value is further lossily compressed as an exponent (blue) and a mantissa (red); the exponent is the position of the leftmost 1 bit in the next 2q bits, and 2q+1 otherwise (above, q=6). The mantissa is simply the value of the next r bits in the bit-string (above, r=4).
Fig. 2.
Fig. 2.
HyperLogLog sections, used alone, result in collisions whenever the minimum hashes match in order of magnitude.
Fig. 3.
Fig. 3.
HyperMinHash further subdivides HyperLogLog leading 1indicator buckets, achieving a much smaller collision space, so long as we precisely store the position of the leading 1.
Fig. 4.
Fig. 4.
In practice, HyperMinHash has a limited number of bits for the LogLog counters, so there is a final lower left bucket at the precision limit.
Fig. 5.
Fig. 5.
We will upper bound the collision probability of HyperMinHash by dividing it into these four regions of integration: (a) the Top Right orange box, (b), the magenta ray covering intermediate boxes, (c) the black strip covering all but the final purple box, and (d) the final purple sub-bucket by the origin.
Fig. 6.
Fig. 6.. For a fixed size sketch, HyperMinHash has better accuracy and/or cardinality range than MinHash.
We compare Jaccard index estimation for identically sized sets with Jaccard index of 0.1, 1/3, 0.5, and 0.9, and plot the mean relative errors without estimated collision correction. All datasets represent between 8 and 96 random repetitions, as needed to get reasonable 95% confidence interval (shaded bands). (blue) A 64 byte HyperMinHash sketch, with 64 buckets of 8 bits each, 4 bits of which are allocated to the LogLog counter. Jaccard index estimation remains stable until cardinalities around 221. (orange) A 64 byte MinHash sketch with 64 buckets of 8 bits each achieves similar accuracy at low cardinalities, but fails once cardinalities approach 213. (green) A 64 byte MinHash sketch with 32 buckets of 16 bits can access larger cardinalities of around 219, but to do so trades off on low-cardinality accuracy.

References

    1. Broder AZ, “On the resemblance and containment of documents,” in Compression and Complexity of Sequences 1997. Proceedings. IEEE, 1997, pp. 21–29.
    1. Li P and Church KW, “Using sketches to estimate associations,” in Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005, pp. 708–715.
    1. Jaccard P, “Lois de distribution florale dans la zone alpine,” Bull Soc Vaudoise Sci Nat, vol. 38, pp. 69–130, 1902.
    1. Bar-Yossef Z, Jayram T, Kumar R, Sivakumar D, and Trevisan L, “Counting distinct elements in a data stream,” in International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 2002, pp. 1–10.
    1. Durand M and Flajolet P, “Loglog counting of large cardinalities,” in European Symposium on Algorithms. Springer, 2003, pp. 605–617.

LinkOut - more resources