在数据库中进行汉明距离/相似度搜索 [英] Hamming Distance / Similarity searches in a database

查看：448 发布时间：2020/6/17 19:22:24 sql search nosql similarity hamming-distance

本文介绍了在数据库中进行汉明距离/相似度搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个过程，类似于tineye生成感知哈希的过程，它们是32位整数.

I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints.

我打算将来将它们存储在sql数据库(可能是nosql db)中

I intend to store these in a sql database (maybe a nosql db) in the future

但是，我很困惑如何根据哈希的相似性来检索记录.

However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes.

有什么想法吗?

推荐答案

一种常见的方法(至少对我来说是常见的)是将您的哈希位字符串划分为多个块，并在这些块上进行查询以实现完全匹配.这是一个预过滤"步骤.然后，您可以对返回的结果执行按位汉明距离计算，该结果应该只是整个数据集的较小子集.这可以通过使用数据文件或SQL表来完成.

A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

因此，简单来说:假设您在数据库中有一堆32位的哈希值，并且想要查找在查询"哈希值的4位汉明距离内的每个哈希值:

So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:

创建一个包含四列的表:每列将包含32位哈希(从1到4)的8位(作为字符串或整数)切片.

create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.

在qslice 1到4中以相同的方式对查询哈希进行切片.

slice your query hash the same way in qslice 1 to 4.

查询该表，以使qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4中的任何一个都有效.这将为您提供查询哈希值3位(4 - 1)之内的每个数据库哈希值.

query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every DB hash that are within 3 bits (4 - 1) of the query hash.

对于每个返回的哈希，使用查询哈希成对计算精确的汉明距离(从四个切片重建索引端哈希)

for each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)

第4步中的操作数应比整个表的完整成对汉明计算少得多.

The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.

这种方法最早是由 Moses Charikar 在其"simhash"开创性的纸张和相应的Google

This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

在汉明空间中近似最近的邻居搜索

[...]

给出每个由d位组成的位向量，我们选择 N = O(n 1/(1+))个比特的随机排列.对于每个随机排列σ，我们维持O的排序顺序Oσ 位向量，按排列的位的字典顺序排列由σ.给定查询位向量q，我们找到近似值通过执行以下操作来最接近的邻居:

Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+ ) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following:

对于每个排列σ，我们对Oσ进行二进制搜索以找到两个最接近q的位向量(按由σ排列的位获得的字典顺序).现在，我们搜索每个上方和下方的排序顺序Oσ检查元素二进制搜索按以下顺序返回的位置与q匹配的最长前缀的长度.

For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q.

Monika Henziger 在她的论文:

3.3算法C的结果

3.3 The Results for Algorithm C

我们将每个页面的位字符串划分为12个非重叠4字节的片段，创建20B片段，并计算至少具有一个的所有页面的C相似度共同点.这种方法可以保证找到所有对差异最大为11的页面，即C相似度373，但可能会因较大差异而错过一些机会.

We partitioned the bit string of each page into 12 non- overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one piece in common. This approach is guaranteed to find all pairs of pages with difference up to 11, i.e., C-similarity 373, but might miss some for larger differences.

检测几乎重复的网络爬网中也对此进行了解释. Gurmeet Singh Manku，Arvind Jain和Anish Das Sarma撰写:

This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

起跑距离问题

定义:给定f位指纹和一个f位指纹的集合查询指纹F，识别是否存在指纹与F的最多k位不同. (在批处理模式版本中上面的问题，我们有一组查询指纹而不是单个查询指纹)

Definition: Given a collection of f -bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint)

[...]

直觉:考虑一个2 d f位真正随机指纹的排序表.只关注最高有效的d位在桌子上.这些d位数字的列表总计为几乎是一个计数器"，其含义是(a)相当多的2 d位- 存在组合，并且(b)很少的d位组合是重复的.另一方面，最低有效f-d 位几乎是随机的.

Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits in the table. A listing of these d-bit numbers amounts to "almost a counter" in the sense that (a) quite a few 2 d bit- combinations exist, and (b) very few d-bit combinations are duplicated. On the other hand, the least significant f − d bits are "almost random".

现在选择d，使| d − d |是一个小整数.自从在对表格进行排序后，一个探针就足以识别出与d个最高有效位中的F相匹配的所有指纹. 由于| d − d |很小，此类匹配的数量也预计会很小.对于每个匹配的指纹，我们可以轻松找出在最多k个位上它是否与F不同否(这些差异自然会仅限于 f-最低有效位).

Now choose d such that |d − d| is a small integer. Since the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions. Since |d − d| is small, the number of such matches is also expected to be small. For each matching fingerprint, we can easily figure out if it differs from F in at most k bit-positions or not (these differences would naturally be restricted to the f − d least-significant bit-positions).

上述过程可帮助我们找到现有的在k位位置上与F不同的指纹，所有这些被限制在的最低有效f-d位之间 F.这可以处理很多案件.覆盖所有在这种情况下，只需要建立少量的额外排序的表格，如在下一节中正式概述的那样.

The procedure described above helps us locate an existing fingerprint that differs from F in k bit-positions, all of which are restricted to be among the least significant f − d bits of F. This takes care of a fair number of cases. To cover all the cases, it suffices to build a small number of additional sorted tables, as formally outlined in the next Section.

PS:大多数FWIW都在某种程度上或某些时间与Google关联.

PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.

这篇关于在数据库中进行汉明距离/相似度搜索的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在数据库中进行汉明距离/相似度搜索 [英] Hamming Distance / Similarity searches in a database

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在数据库中进行汉明距离/相似度搜索 [英] Hamming Distance / Similarity searches in a database

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭