LSH使用哈希混淆 [英] Confusion in hashing used by LSH
问题描述
现在它说每条数据( b
数目, r
的长度)将其列分散,以便列落入一个桶中。如果两列落在同一个桶中,对于> = 1的条带,则它们可能是相似的。
这意味着我应该创建 b
散列表并找到 b
独立散列函数?或者只有一个就足够了,每一个条带都会将它的列发送到同一个桶的集合(但是不会取消这个条带)?
字典足以满足散列表在这种情况下 * ?
我想到了它,发布给未来的读者。
我打算使用一个字典,因为幻灯片提到可以对每个条纹使用相同的散列函数(字典)。
每个存储桶都是我们字典的关键字。
插入时,文档(即属于在一个条带中)将被一个散列函数(我们将创建)传递,而结果应该是关键。这样我们的字典将被填充。
Matrix M
is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document.
Now it says that every stripe (b
in number, r
in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar.
So that means that I should create b
hashtables and find b
independent hash functions? Or just one is enough and every stripe sends its columns to the same collections of buckets (but wouldn't this cancel the stripes)?
Would a dictionary be enough for a hashtable in this case*?
*Is a Python dictionary an example of a hash table?
I think I figured it out, posting for future readers.
I am going to use one dictionary, since the slides mentioned that it's OK to use the same hash function for every stripe (dictionaries do that).
Every bucket will be a key for our dictionary.
On insertion, a document (i.e. a column which belongs in a stripe) will be passed by a hash function (which we will create) and the result should be a key. That way our dictionary will be populated.
这篇关于LSH使用哈希混淆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!