LSH中的非空桶 [英] Non-empty buckets in LSH

查看:147
本文介绍了LSH中的非空桶的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读有关 LSH 的调查,尤其是引用2.2.1部分的最后一段:

I'm reading this survey about LSH, in particular citing the last paragraph of section 2.2.1:

为了提高召回率,构建了L个哈希表,并且将各项 位于L(L',L'< L)哈希桶h_1(q),··,h_L(q) 被作为q的附近项进行检索,以进行随机R-近邻搜索 (或随机c-近似R附近的邻居搜索). 为了保证精度,L个哈希码y_i中的每一个都需要 是一个长代码,这意味着存储桶总数为 太大而无法直接建立索引. 因此,只有非空桶是 通过使用哈希码h_1的对流哈希来保留 (x).

To improve the recall, L hash tables are constructed, and the items lying in the L (L ′ , L ′ < L) hash buckets h_1 (q), · · · , h_L (q) are retrieved as near items of q for randomized R-near neighbor search (or randomized c- approximate R-near neighbor search). To guarantee the precision, each of the L hash codes, y_i , needs to be a long code, which means that the total number of the buckets is too large to index directly. Thus, only the nonempty buckets are retained by resorting to convectional hashing of the hash codes h_l (x).

我有3个问题:

  1. 粗体的句子对我来说还不清楚:重新排序为哈希码h_l (x)的常规哈希"是什么意思?
  2. 关于粗体语句,我不确定是否遇到了问题:我完全理解h_l(x)可能是一个很长的代码,因此可能的存储桶数量可能很大.例如,如果h_l(x)是二进制代码,而lengthh_l(x)的长​​度,那么我们总共有L*2^length个可能的存储桶(因为我们使用了L哈希表)...对吗?
  3. 最后一个问题:一旦找到查询向量q属于哪个存储桶,为了找到最近的邻居,我们必须使用原始向量q和原始距离度量?例如,假设原始向量q在128维q=[1,0,12,...,14.3]^T中,并且在我们的应用程序中使用了欧几里德距离.现在假设在LSH中使用的哈希函数(为简单起见,假设L = 1)将此向量映射到20维y=[0100...11]^T的二进制空间,以便确定将哪个存储桶分配q.因此y具有与存储区B相同的索引,并且已经包含100个向量.现在,为了找到最近的邻居,我们必须使用欧氏距离将q与所有其他100128维向量进行比较.这是正确的吗?
  1. The bold sentence is not clear to me: what does it mean by "resorting to convenctional hashing of the hash codes h_l (x)"?
  2. Always about the bold sentence, I'm not sure that I got the problem: I totally understand that h_l(x) can be a long code and so the number of possible buckets can be huge. For example, if h_l(x) is a binary code and length is h_l(x)'s length, then we have in total L*2^length possible buckets (since we use L hash tables)...is that correct?
  3. Last question: once we find which bucket the query vector q belongs to, in order to find the nearest neighbor we have to use the original vector q and the original distance metric? For example, let suppose that the original vector q is in 128 dimensions q=[1,0,12,...,14.3]^T and it uses the euclidean distance in our application. Now suppose that our hashing function (supposing that L=1 for simplicity) used in LSH maps this vector to a binary space in 20 dimensions y=[0100...11]^T in order to decide which bucket assign q to. So y has the same index of the bucket B, and which already contains 100 vectors. Now, in order to find the nearest neighbor, we have to compare q with all the others 100 128-dimensions vectors using euclidean distance. Is this correct?

推荐答案

他们用于改善召回率的方法构造了更多的哈希表,并且实质上为每个参考项存储了ID的多个副本,因此空间成本更大[4].如果有很多空桶,这增加了检索成本,则可以使用汉明空间中的双哈希方案或快速搜索算法来快速检索哈希桶.我认为在这种情况下,他们使用双哈希函数来检索非空存储桶.

Approach they are using to improve recall constructs more hash tables and essentially stores multiple copies of the ID for each reference item, hence space cost is larger [4]. If there are a lot of empty buckets which increases the retrieval cost, the double-hash scheme or fast search algorithm in the Hamming space can be used to fast retrieve the hash buckets. I think in this case they are using double hash function to retrieve non-empty buckets.

存储桶/内存单元数[1] [2] [3]-> O(nL)

No of buckets/memory cells [1][2][3] -> O(nL)

参考:

[1] http://simsearch.yury.name/russir/03nncourse-hand .pdf

[2] http://joyceho.github.io/cs584_s16/slides/lsh-12.pdf

[3] https://users.soe.ucsc.edu/〜niejiazhong/slides/kumar.pdf

[4] http://research.microsoft .com/en-us/um/people/jingdw/Pubs%5CLTHSurvey.pdf

这篇关于LSH中的非空桶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆