LSH使用哈希混淆 [英] Confusion in hashing used by LSH

查看:159
本文介绍了LSH使用哈希混淆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



M 是签名矩阵,它是通过Minhashing生成的实际数据,将文档作为列和单词作为行。所以一列代表一个文件。



现在它说每条数据( b 数目, r 的长度)将其列分散,以便列落入一个桶中。如果两列落在同一个桶中,对于> = 1的条带,则它们可能是相似的。



这意味着我应该创建 b 散列表并找到 b 独立散列函数?或者只有一个就足够了,每一个条带都会将它的列发送到同一个桶的集合(但是不会取消这个条带)?

字典足以满足散列表在这种情况下 *



* 是一个Python字典的哈希表的例子吗?

解决方案

我想到了它,发布给未来的读者。

我打算使用一个字典,因为幻灯片提到可以对每个条纹使用相同的散列函数(字典)。



每个存储桶都是我们字典的关键字。



插入时,文档(即属于在一个条带中)将被一个散列函数(我们将创建)传递,而结果应该是关键。这样我们的字典将被填充。


Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document.

Now it says that every stripe (b in number, r in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar.

So that means that I should create b hashtables and find b independent hash functions? Or just one is enough and every stripe sends its columns to the same collections of buckets (but wouldn't this cancel the stripes)?

Would a dictionary be enough for a hashtable in this case*?

*Is a Python dictionary an example of a hash table?

解决方案

I think I figured it out, posting for future readers.

I am going to use one dictionary, since the slides mentioned that it's OK to use the same hash function for every stripe (dictionaries do that).

Every bucket will be a key for our dictionary.

On insertion, a document (i.e. a column which belongs in a stripe) will be passed by a hash function (which we will create) and the result should be a key. That way our dictionary will be populated.

这篇关于LSH使用哈希混淆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆