多少散列函数都需要在一个minhash算法 [英] How many hash functions are required in a minhash algorithm

查看:222
本文介绍了多少散列函数都需要在一个minhash算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是热衷于尝试和实施minhashing附近找到重复的内容。 http://blog.cluster-text.com/tag/minhash/ 有一个很好的写了起来,但在那里,你有多少哈希算法需要整个带状疱疹文档中运行得到合理的结果的问题。

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.

博客文章上面提到的像200散列算法。 <一href="http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx">http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx列出100作为默认值。

The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.

显然有增加了准确性的哈希值数量的增加,但究竟有多少散列函数是合理的?

Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?

要在博客引用

这是很难得到的错误栏上我们相似的估计多   比[7%]因为统计的方法误差线的小   采样值规模 - 削减了一半的差错吧,我们就需要四   倍,许多样品

It is tough to get the error bar on our similarity estimate much smaller than [7%] because of the way error bars on statistically sampled values scale — to cut the error bar in half we would need four times as many samples.

这是否意味着意味着减少哈希的数量为类似于12(200/4/4)将导致28%的误差率(7 * 2 * 2)?

Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?

推荐答案

pretty的多..但28%的人是错误估计,意思是报告的测量经常会不准确通过+/- 28%。

Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.

这意味着,78%报告的测量可以很容易地仅仅来自50%的相似性.. 或50%的相似性可以很容易地被报告为22%。听起来不够准确的商业预期,给我。

That means that a reported measurement of 78% could easily come from only 50% similarity.. Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.

在数学上,如果您报告两个数字,第二应该是有意义的。

Mathematically, if you're reporting two digits the second should be meaningful.

为什么要减少哈希函数的数量为12?什么200散列函数的真正含义是,计算出一个像样的品质散列code为每瓦/串一次 - 再申请200便宜和放大器;快速的转换,强调某些因素/带来一定的位向前方。

Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.

我推荐相结合的按位轮换(或改组)和 XOR运算。每个散列函数可以由位的一些数目结合的旋转,然后由一个随机生成的整数异或

I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.

这两个S $ P $垫周围的位分()函数的选择性,至于什么值MIN()最终选择的。

This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.

旋转的理由是,分(智力),将255次出256,选择只在8最显著位。只有当所有的顶部位是一样的,做低位有比较任何影响。所以小号preading可能是有用的,以避免过分强调一个或两个字符的木瓦。

The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.

有关XOR的理由是,在它自己的,按位旋转(ROTR)可能的时间(当0位从左移)50%向零收敛,这将导致独立的哈希函数显示的不良倾向一致趋向于零在一起 - 从而过度倾向,为他们最终选择相同瓦,不是独立的带状疱疹

The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.

有一个非常有趣的按位怪癖符号整数,其中MSB是负的,但是所有的下列位是积极的,呈现旋转的倾向收敛于的符号整数不太明显的 - 哪里会为明显的签名的。 XOR仍然必须在这些情况下使用,反正。

There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.

Java有32位散列codeS内置。如果你使用谷歌番石榴库中,有64位散列codeS可用。

Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.

感谢@BillDimm他的输入和放大器;持续性地指出,XOR是必要的。

Thanks to @BillDimm for his input & persistence in pointing out that XOR was necessary.

这篇关于多少散列函数都需要在一个minhash算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆