局部敏感散列(LSH)如何工作? [英] How Locality Sensitive Hashing (LSH) works?

查看:167
本文介绍了局部敏感散列(LSH)如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读过这个问题,但不幸的是,

我不明白的是当我们了解将哪个桶分配给我们的高维空间查询向量 q :假设使用我们的局部敏感族函数 h_1,h_2,...,h_n ,我们翻译了 q 到一个低维( n 维度)散列码 c



然后 c 是分配给 q 的桶的索引,在哪里(希望)也被指定为最近的邻居,假设有100个向量。

现在,我们要做的是为了找到 q 的NN是计算 q 这100个向量之间的距离,是正确的吗?因此,使用 c 仅用于索引(它仅用于决定哪个存储区分配 q 来),右?




另一种解决方案,如这个调查(第2.2节),哈希表查找(前面描述的方法)的替代方法是快速近似,所以我们进行详尽的研究,计算 c 与生成的哈希码之间的距离,该哈希码相对于数据集中的每个元素。这应该是快速的,因为散列码处于低维空间中,并且计算距离应该更快(例如,如果哈希码空间是二进制的,那么我们可以使用XOR运算符来快速计算汉明两个哈希码之间的距离)。

现在我想知道的是:这两种方法的优缺点是什么?为什么我应该使用一种方法而不是另一种? 解决方案

第一种描述的方法解释了近似最近邻居搜索。是的,只要检查bin c 中的其他100个项目即可获得最佳性能,但是在其他邻近存储桶中缺少优秀候选人的风险更高。

经纬度坐标的简单哈希方案是地理散列。您可以通过查看同一Geohash区块内的物品来找到最近的商店,但在网格边界附近可能会得到不准确的结果。



快速距离近似的一个示例可以找到这里,它发现图像与足够小的汉明距离匹配(利用 pHash )。由于散列长度仅为64位,笔记本电脑GPU可以进行7亿次比较/秒。请注意,所有图像都被检查,不使用索引或哈希数据结构。这样你的年龄保证可以得到所有的比赛(在pHash空间)。

I've read already this question, but unfortunately it didn't help.

What I don't understand is what we do once we understood which bucket assign to our high-dimensional space query vector q: suppose that using our set of locality sensitive family functions h_1,h_2,...,h_n we have translated q to a low-dimension (n dimensions) hash code c.

Then c is the index of the bucket which q is assigned to and where (hopefully) are assigned also its nearest neighbors, let say that there are 100 vectors.

Now, what we do in order to find the q's NN is to compute the distance between q and only these 100 vectors, is that correct? So the use of c is only for indexing (it's used only for deciding which bucket assign q to`), right?


Another solution, as described in this survey (section 2.2), an alternative to "hash table lookup" (the previously described approach) is "fast distance approximation", so we perform an exhaustive research where we compute the distance between the c and the generated hash code relative to each element in the dataset. This is supposed to be fast since the hash code are in a low-dimension space AND the distance is supposed to be faster to compute (for example if the hash code space is binary, then we can use the XOR operator for quickly computing the hamming distance between two hash codes).

Now, what I wonder is: what are the advantages/disadvantages of the two methods? Why should I use one approach instead of the other one?

解决方案

The first described method explains an approximate nearest neighbors search. Yes you'd get the best performance by just checking those 100 other items in the bin c, but you've got a higher risk on missing good candidates in other neighboring buckets.

A simple hashing scheme for lat/lon coordinates is the Geohash. You could find nearest shop by looking at items within the same Geohash block, but inaccurate results are possible near grid boundaries.

One example of fast distance approximation can be found here, it finds image matches with sufficiently small Hamming distance (utilizing pHash). As hashes are only 64-bits long a laptop GPU could do 700 million comparisons / second. Note that all images are checked, no indexing or hashing data structure is used. This way you age guaranteed to get all matches (in pHash space).

这篇关于局部敏感散列(LSH)如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆