LSH中的非空桶 [英] Non-empty buckets in LSH
问题描述
我正在阅读有关 LSH 的调查,尤其是引用2.2.1
部分的最后一段:
I'm reading this survey about LSH, in particular citing the last paragraph of section 2.2.1
:
为了提高召回率,构建了L个哈希表,并且将各项 位于L(L',L'< L)哈希桶h_1(q),··,h_L(q) 被作为q的附近项进行检索,以进行随机R-近邻搜索 (或随机c-近似R附近的邻居搜索). 为了保证精度,L个哈希码y_i中的每一个都需要 是一个长代码,这意味着存储桶总数为 太大而无法直接建立索引. 因此,只有非空桶是 通过使用哈希码h_1的对流哈希来保留 (x).
To improve the recall, L hash tables are constructed, and the items lying in the L (L ′ , L ′ < L) hash buckets h_1 (q), · · · , h_L (q) are retrieved as near items of q for randomized R-near neighbor search (or randomized c- approximate R-near neighbor search). To guarantee the precision, each of the L hash codes, y_i , needs to be a long code, which means that the total number of the buckets is too large to index directly. Thus, only the nonempty buckets are retained by resorting to convectional hashing of the hash codes h_l (x).
我有3个问题:
- 粗体的句子对我来说还不清楚:重新排序为哈希码
h_l (x)
的常规哈希"是什么意思? - 关于粗体语句,我不确定是否遇到了问题:我完全理解
h_l(x)
可能是一个很长的代码,因此可能的存储桶数量可能很大.例如,如果h_l(x)
是二进制代码,而length
是h_l(x)
的长度,那么我们总共有L*2^length
个可能的存储桶(因为我们使用了L
哈希表)...对吗? - 最后一个问题:一旦找到查询向量
q
属于哪个存储桶,为了找到最近的邻居,我们必须使用原始向量q
和原始距离度量?例如,假设原始向量q
在128维q=[1,0,12,...,14.3]^T
中,并且在我们的应用程序中使用了欧几里德距离.现在假设在LSH中使用的哈希函数(为简单起见,假设L = 1)将此向量映射到20维y=[0100...11]^T
的二进制空间,以便确定将哪个存储桶分配q
.因此y
具有与存储区B
相同的索引,并且已经包含100个向量.现在,为了找到最近的邻居,我们必须使用欧氏距离将q
与所有其他100128维向量进行比较.这是正确的吗?
- The bold sentence is not clear to me: what does it mean by "resorting to convenctional hashing of the hash codes
h_l (x)
"? - Always about the bold sentence, I'm not sure that I got the problem: I totally understand that
h_l(x)
can be a long code and so the number of possible buckets can be huge. For example, ifh_l(x)
is a binary code andlength
ish_l(x)
's length, then we have in totalL*2^length
possible buckets (since we useL
hash tables)...is that correct? - Last question: once we find which bucket the query vector
q
belongs to, in order to find the nearest neighbor we have to use the original vectorq
and the original distance metric? For example, let suppose that the original vectorq
is in 128 dimensionsq=[1,0,12,...,14.3]^T
and it uses the euclidean distance in our application. Now suppose that our hashing function (supposing that L=1 for simplicity) used in LSH maps this vector to a binary space in 20 dimensionsy=[0100...11]^T
in order to decide which bucket assignq
to. Soy
has the same index of the bucketB
, and which already contains 100 vectors. Now, in order to find the nearest neighbor, we have to compareq
with all the others 100 128-dimensions vectors using euclidean distance. Is this correct?
推荐答案
他们用于改善召回率的方法构造了更多的哈希表,并且实质上为每个参考项存储了ID的多个副本,因此空间成本更大[4].如果有很多空桶,这增加了检索成本,则可以使用汉明空间中的双哈希方案或快速搜索算法来快速检索哈希桶.我认为在这种情况下,他们使用双哈希函数来检索非空存储桶.
Approach they are using to improve recall constructs more hash tables and essentially stores multiple copies of the ID for each reference item, hence space cost is larger [4]. If there are a lot of empty buckets which increases the retrieval cost, the double-hash scheme or fast search algorithm in the Hamming space can be used to fast retrieve the hash buckets. I think in this case they are using double hash function to retrieve non-empty buckets.
存储桶/内存单元数[1] [2] [3]-> O(nL)
No of buckets/memory cells [1][2][3] -> O(nL)
参考:
[1] http://simsearch.yury.name/russir/03nncourse-hand .pdf
[2] http://joyceho.github.io/cs584_s16/slides/lsh-12.pdf
[3] https://users.soe.ucsc.edu/〜niejiazhong/slides/kumar.pdf
[4] http://research.microsoft .com/en-us/um/people/jingdw/Pubs%5CLTHSurvey.pdf
这篇关于LSH中的非空桶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!