LSH使用哈希混淆 [英] Confusion in hashing used by LSH

查看：159 发布时间：2018/6/1 15:33:24 python hash nearest-neighbor locality-sensitive-hash bigdata

本文介绍了LSH使用哈希混淆的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

M 是签名矩阵，它是通过Minhashing生成的实际数据，将文档作为列和单词作为行。所以一列代表一个文件。

现在它说每条数据（ b 数目， r 的长度）将其列分散，以便列落入一个桶中。如果两列落在同一个桶中，对于> = 1的条带，则它们可能是相似的。

这意味着我应该创建 b 散列表并找到 b 独立散列函数？或者只有一个就足够了，每一个条带都会将它的列发送到同一个桶的集合（但是不会取消这个条带）？

字典足以满足散列表在这种情况下^*？

* _{是一个Python字典的哈希表的例子吗？}

解决方案

我想到了它，发布给未来的读者。

我打算使用一个字典，因为幻灯片提到可以对每个条纹使用相同的散列函数（字典）。

每个存储桶都是我们字典的关键字。

插入时，文档（即属于在一个条带中）将被一个散列函数（我们将创建）传递，而结果应该是关键。这样我们的字典将被填充。

Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document.

Now it says that every stripe (b in number, r in length) has its columns hashed, so that a column falls in a bucket. If two columns fall in the same bucket, for >= 1 stripes, then they are potentially similar.

So that means that I should create b hashtables and find b independent hash functions? Or just one is enough and every stripe sends its columns to the same collections of buckets (but wouldn't this cancel the stripes)?

Would a dictionary be enough for a hashtable in this case^*?

*_{Is a Python dictionary an example of a hash table?}

解决方案

I think I figured it out, posting for future readers.

I am going to use one dictionary, since the slides mentioned that it's OK to use the same hash function for every stripe (dictionaries do that).

Every bucket will be a key for our dictionary.

On insertion, a document (i.e. a column which belongs in a stripe) will be passed by a hash function (which we will create) and the result should be a key. That way our dictionary will be populated.

这篇关于LSH使用哈希混淆的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

LSH使用哈希混淆 [英] Confusion in hashing used by LSH

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

LSH使用哈希混淆 [英] Confusion in hashing used by LSH

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭