搜索位置敏感的哈希 [英] Search in locality sensitive hashing

本文介绍了搜索位置敏感的哈希的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试了解此问题,但是我没有无法理解评论中的答案.也在这个问题中在第2点中,描述了相同的算法,但同样,我不了解其工作原理.

您能尝试向我解释一下它如何逐步工作,以使其尽可能简单吗?

我什至试图列出我不理解的东西,但实际上写得很烂,以至于我不理解大多数句子!

在gsamaras回答后进行

我基本上理解了答案,但是我仍然有一些疑问:

  1. 是否正确地说执行N置换的总成本为O(Nnlogn),因为我们必须对每个置换进行排序?

  2. 上面描述的排列+排序过程在预处理期间仅执行一次,或者对于每个查询q?甚至在预处理中,O(Nnlogn)似乎已经很昂贵了,如果我们必须在查询时这样做的话,那就很麻烦了:

  3. 最后,我们将v0v4q进行比较,我们比较它们的置换版本还是原始版本(置换之前)?

解决方案

这个问题在某种程度上是广泛的,因此我将在此处给出一个最小的(抽象的)示例:

我们的数据集中有6个(= n)向量,每个向量都有d位.假设我们进行2(= N)个随机置换.

让第一个随机置换开始!请记住,我们置换了,而不是向量的顺序.在对位进行排列后,它们将保持顺序,例如:

v1
v5
v0
v3
v2
v4

现在查询向量q到了,但(几乎)不太可能成为数据集中带有向量的 same (在排列后),因此我们将'不能通过执行二进制搜索找到它.

但是,我们将最终在两个向量之间.因此,现在我们可以想象这种情况是这样的(例如q位于v0和v3之间:

v1
v5
v0 <-- up pointer
   <-- q lies here
v3 <-- down pointer
v2
v4

现在,我们上下移动指针,以寻找与q最多匹配的vi向量.假设是v0.

类似地,我们进行第二个置换,我们找到向量vi,比方说v4.现在,我们将第一个置换中的v0与v4进行比较,以查看哪个最接近q,即哪个具有最多与q相等的位.


编辑:

可以说,执行N个置换的总成本为O(Nnlogn),因为我们必须对每个置换进行排序?

如果他们实际上是从头开始对每个排列进行排序,那么,但是我不清楚他们是如何做到的.

上述置换+排序过程在预处理期间或针对每个查询仅执行一次q?

一次.

最后一点,我们将v0v4q进行比较,我们比较它们的排列版本还是原始版本(在排列之前)?

认为他们使用排列的版本来做到这一点(请参见本文中2N之前的括号).但这没有什么区别,因为它们也以相同的置换(σ)置换q.


定额答案可能也可以说明一些问题

I'm trying to understand the section 5. of this paper about LSH, in particular how to bucket the generated hashes. Quoting the linked paper:

Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+epsilon) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following: For each permu- tation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order ob- tained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q. This can be done by maintaining two pointers for each sorted order O σ (one moves up and the other down). At each step we move one of the pointers up or down corresponding to the element with the longest matching prefix. (Here the length of the longest matching prefix in O σ is computed relative to q with its bits permuted by σ). We examine 2N = O(n 1/(1+epsilon) ) bit vec- tors in this way. Of all the bit vectors examined, we return the one that has the smallest Hamming distance to q.

I'm confused by this algorithm and I don't think that I understood how it works.

I found already this question on the topic, but I didn't understand the answer in the comments. Also in this question in point 2 the same algorithm is described but again, I don't understand how its works.

Can you please try to explain to me how it works step by step trying to be more simple as possible?

I even tried to make a list of things that I don't understand, but in practice is so bad written that I don't understand most of the sentences!

EDIT after gsamaras answer:

I mostly understood the answer, but I still have some doubts:

  1. Is it correct to say that the total cost of performing the N permutations is O(Nnlogn), since we have to sort each one of them ?

  2. The permutation+sorting process described above is performed only once during the pre-processing or for every query q? It seems already pretty expensive O(Nnlogn) even in pre-processing, if we have to do this at query time it's a disaster :D

  3. At the last point, where we compare v0 and v4 to q, we compare their permuted version or the original one (before their permutation)?

解决方案

This question is somehow broad, so I am just going to give a minimal (abstract) example here:

We have 6 (= n) vectors in our dataset, with d bits each. Let's assume that we do 2 (= N) random permutation.

Let the 1st random permutation begin! Remember that we permute the bits, not the order of the vectors. After permuting the bits, they maintain an order, for example:

v1
v5
v0
v3
v2
v4

Now the query vector, q, arrives, but it's (almost) unlikely that is going to be the same with a vector in our dataset (after the permutation), thus we won't find it by performing binary search.

However, we are going to end up between two vectors. So now we can imagine the scenario to be like this (for example q lies between v0 and v3:

v1
v5
v0 <-- up pointer
   <-- q lies here
v3 <-- down pointer
v2
v4

Now we move either up or down pointer, seeking for the vi vector that will match at the most bits with q. Let's say it was v0.

Similarly, we do the second permutation and we find the vector vi, let's say v4. we now compare v0 from the first permutation and v4, to see which one is closest to q, i.e. which one has the most bits equal with q.


Edit:

Is it correct to say that the total cost of performing the N permutations is O(Nnlogn), since we have to sort each one of them?

If they actually sort every permutation from scratch, then yes, but it's not clear for me how they do it.

The permutation+sorting process described above is performed only once during the pre-processing or for every query q?

ONCE.

At the last point, where we compare v0 and v4 to q, we compare their permuted version or the original one (before their permutation)?

I think they do it with the permuted version (see the parentheses before 2N in the paper). But that wouldn't make any difference, since they permute q too with the same permute (σ).


This quora answer may shed some light too.

这篇关于搜索位置敏感的哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆