在数据库中进行汉明距离/相似度搜索 [英] Hamming Distance / Similarity searches in a database

查看:448
本文介绍了在数据库中进行汉明距离/相似度搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个过程,类似于tineye生成感知哈希的过程,它们是32位整数.

I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints.

我打算将来将它们存储在sql数据库(可能是nosql db)中

I intend to store these in a sql database (maybe a nosql db) in the future

但是,我很困惑如何根据哈希的相似性来检索记录.

However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes.

有什么想法吗?

推荐答案

一种常见的方法(至少对我来说是常见的)是将您的哈希位字符串划分为多个块,并在这些块上进行查询以实现完全匹配.这是一个预过滤"步骤.然后,您可以对返回的结果执行按位汉明距离计算,该结果应该只是整个数据集的较小子集.这可以通过使用数据文件或SQL表来完成.

A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

因此,简单来说:假设您在数据库中有一堆32位的哈希值,并且想要查找在查询"哈希值的4位汉明距离内的每个哈希值:

So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:

  1. 创建一个包含四列的表:每列将包含32位哈希(从1到4)的8位(作为字符串或整数)切片.

  1. create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.

在qslice 1到4中以相同的方式对查询哈希进行切片.

slice your query hash the same way in qslice 1 to 4.

查询该表,以使qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4中的任何一个都有效.这将为您提供查询哈希值3位(4 - 1)之内的每个数据库哈希值.

query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every DB hash that are within 3 bits (4 - 1) of the query hash.

对于每个返回的哈希,使用查询哈希成对计算精确的汉明距离(从四个切片重建索引端哈希)

for each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)

第4步中的操作数应比整个表的完整成对汉明计算少得多.

The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.

这种方法最早是由 Moses Charikar 在其"simhash"开创性的纸张和相应的Google

This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

  1. 在汉明空间中近似最近的邻居搜索

[...]

给出每个由d位组成的位向量,我们选择 N = O(n 1/(1+))个比特的随机排列.对于每个 随机排列σ,我们维持O的排序顺序Oσ 位向量,按排列的位的字典顺序排列 由σ.给定查询位向量q,我们找到近似值 通过执行以下操作来最接近的邻居:

Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+ ) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following:

对于每个排列σ,我们对Oσ进行二进制搜索以找到 两个最接近q的位向量(按由σ排列的位获得的字典顺序).现在,我们搜索每个 上方和下方的排序顺序Oσ检查元素 二进制搜索按以下顺序返回的位置 与q匹配的最长前缀的长度.

For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q.

Monika Henziger 在她的论文:

3.3算法C的结果

3.3 The Results for Algorithm C

我们将每个页面的位字符串划分为12个非 重叠4字节的片段,创建20B片段,并计算至少具有一个的所有页面的C相似度 共同点.这种方法可以保证找到所有 对差异最大为11的页面,即C相似度373, 但可能会因较大差异而错过一些机会.

We partitioned the bit string of each page into 12 non- overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one piece in common. This approach is guaranteed to find all pairs of pages with difference up to 11, i.e., C-similarity 373, but might miss some for larger differences.

检测几乎重复的网络爬网中也对此进行了解释. Gurmeet Singh Manku,Arvind Jain和Anish Das Sarma撰写:

This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

  1. 起跑距离问题

定义:给定f位指纹和一个f位指纹的集合 查询指纹F,识别是否存在指纹 与F的最多k位不同. (在批处理模式版本中 上面的问题,我们有一组查询指纹 而不是单个查询指纹)

Definition: Given a collection of f -bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint)

[...]

直觉:考虑一个2 d f位真正随机指纹的排序表.只关注最高有效的d位 在桌子上.这些d位数字的列表总计为 几乎是一个计数器",其含义是(a)相当多的2 d位- 存在组合,并且(b)很少的d位组合是 重复的.另一方面,最低有效f-d 位几乎是随机的.

Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits in the table. A listing of these d-bit numbers amounts to "almost a counter" in the sense that (a) quite a few 2 d bit- combinations exist, and (b) very few d-bit combinations are duplicated. On the other hand, the least significant f − d bits are "almost random".

现在选择d,使| d − d |是一个小整数.自从 在对表格进行排序后,一个探针就足以识别出与d个最高有效位中的F相匹配的所有指纹. 由于| d − d |很小,此类匹配的数量也 预计会很小.对于每个匹配的指纹,我们可以 轻松找出在最多k个位上它是否与F不同 否(这些差异自然会仅限于 f-最低有效位).

Now choose d such that |d − d| is a small integer. Since the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions. Since |d − d| is small, the number of such matches is also expected to be small. For each matching fingerprint, we can easily figure out if it differs from F in at most k bit-positions or not (these differences would naturally be restricted to the f − d least-significant bit-positions).

上述过程可帮助我们找到现有的 在k位位置上与F不同的指纹,所有这些 被限制在的最低有效f-d位之间 F.这可以处理很多案件.覆盖所有 在这种情况下,只需要建立少量的额外 排序的表格,如在下一节中正式概述的那样.

The procedure described above helps us locate an existing fingerprint that differs from F in k bit-positions, all of which are restricted to be among the least significant f − d bits of F. This takes care of a fair number of cases. To cover all the cases, it suffices to build a small number of additional sorted tables, as formally outlined in the next Section.

PS:大多数FWIW都在某种程度上或某些时间与Google关联.

PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.

这篇关于在数据库中进行汉明距离/相似度搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆