minhash 算法需要多少个哈希函数 [英] How many hash functions are required in a minhash algorithm

查看:48
本文介绍了minhash 算法需要多少个哈希函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我热衷于尝试实施 minhashing 以查找几乎重复的内容.http://blog.cluster-text.com/tag/minhash/ 写得很好,但是还有一个问题是,您需要在文档中的带状疱疹上运行多少个散列算法才能获得合理的结果.

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.

上面的博文提到了诸如 200 种哈希算法之类的东西.http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx 列出 100 作为默认值.

The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.

显然随着散列数的增加,准确率会提高,但多少散列函数是合理的?

Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?

引自博客

在我们的相似性估计中很难得到误差条小于 [7%] 因为统计上的误差线采样值缩放——要将误差条减半,我们需要四个数倍的样本.

It is tough to get the error bar on our similarity estimate much smaller than [7%] because of the way error bars on statistically sampled values scale — to cut the error bar in half we would need four times as many samples.

这是否意味着将哈希数减少到 12 (200/4/4) 会导致 28% (7 * 2 * 2) 的错误率?

Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?

推荐答案

差不多......但 28% 将是错误估计",这意味着报告的测量值经常会出现 +/- 28% 的不准确.

Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.

这意味着报告的 78% 的测量很容易来自仅 50% 的相似性.或者 50% 的相似度很容易被报告为 22%.对我来说,这听起来不够准确,无法满足业务预期.

That means that a reported measurement of 78% could easily come from only 50% similarity.. Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.

从数学上讲,如果您报告的是两位数,则第二位应该是有意义的.

Mathematically, if you're reporting two digits the second should be meaningful.

为什么要将哈希函数的数量减少到 12 个?200 个哈希函数"的真正含义是,为每个 shingle/string 计算一次质量不错的哈希码——然后应用 200 个便宜的 &快速转换,强调某些因素/将某些位放在前面.

Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.

我建议结合按位旋转(或混洗)和异或运算.每个散列函数可以组合旋转一定数量的位,然后通过随机生成的整数进行异或运算.

I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.

这既传播"了 min() 函数在位周围的选择性,也包括 min() 最终选择的值.

This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.

旋转的基本原理是,min(Int)"将在 256 次中的 255 次中仅选择 8 个最高有效位.只有当所有高位都相同时,低位才会在比较中产生任何影响..因此展开有助于避免过分强调木瓦中的一两个字符.

The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.

XOR 的基本原理是,按位旋转 (ROTR) 本身可以在 50% 的时间(当从左侧移入 0 位时)收敛到零,这会导致单独的"散列函数表现出一种不受欢迎的趋向于零的趋势 - 因此他们过度倾向于最终选择相同的带状疱疹,而不是独立的带状疱疹.

The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.

有符号整数有一个非常有趣的按位"怪癖,其中 MSB 为负,但所有后续位均为正,这使得 有符号整数的旋转收敛趋势不太明显 --对于未签名来说很明显.无论如何,在这些情况下仍必须使用 XOR.

There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.

Java 内置了 32 位哈希码.如果您使用 Google Guava 库,则可以使用 64 位哈希码.

Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.

感谢@BillDimm 的投入和;坚持指出 XOR 是必要的.

Thanks to @BillDimm for his input & persistence in pointing out that XOR was necessary.

这篇关于minhash 算法需要多少个哈希函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆