数据库中的汉明距离/相似性搜索 [英] Hamming Distance / Similarity searches in a database

查看:32
本文介绍了数据库中的汉明距离/相似性搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个过程,类似于 tineye 生成感知散列,这些是 32 位整数.

我打算将来将这些存储在 sql 数据库(可能是 nosql db)中

然而,我对如何根据哈希的相似性检索记录感到困惑.

有什么想法吗?

解决方案

一个常见的方法(至少对我来说是常见的)是将你的哈希位串分成几个块,然后查询这些块以获得精确匹配.这是一个预过滤"步骤.然后,您可以对返回的结果执行按位汉明距离计算,这应该只是整个数据集的一个较小子集.这可以使用数据文件或 SQL 表来完成.

简单来说:假设您在数据库中有一堆 32 位散列,并且您想找到在查询"散列的 4 位汉明距离内的每个散列:

  1. 创建一个包含四列的表:每列将包含 32 位散列的 8 位(作为字符串或整数)切片,即切片 1 到 4.

  2. 在 qslice 1 到 4 中以相同的方式对查询哈希进行切片.

  3. 查询这个表,使得 qslice1=islice1 或 qslice2=islice2 或 qslice3=islice3 或 qslice4=islice4 中的任何一个.这为您提供了查询哈希的 3 位 (4 - 1) 内的每个数据库哈希.

  4. 对于每个返回的散列,与查询散列成对计算精确的汉明距离(从四个切片重建索引端散列)

步骤 4 中的操作次数应远少于对整个表的完整成对汉明计算.

这种方法首先由 Moses Charikar 在其simhash"开创性的paper 和相应的 Google 专利:

<块引用>

  1. 汉明空间中的近似最近邻搜索

[...]

给定每个由 d 位组成的位向量,我们选择N = O(n 1/(1+ ) ) 位的随机排列.对于每个随机排列 σ,我们保持排序的顺序 O σ位向量,按位排列的字典顺序由 σ.给定一个查询位向量 q,我们找到近似值最近邻居通过执行以下操作:

对于每个排列 σ,我们对 O σ 执行二分搜索来定位最接近 q 的两个位向量(按照由 σ 置换的位获得的字典顺序).我们现在在每个排序顺序 O σ 检查上方和下方的元素二分查找返回的位置与 q 匹配的最长前缀的长度.

Monika Henziger 在她的论文 "查找近似重复的网页:算法的大规模评估":

<块引用>

3.3 算法 C 的结果

我们将每一页的位串划分为 12 个非重叠 4 字节片段,创建 20B 片段,并计算所有页面的 C 相似度,这些页面至少有一个共同点.这种方法保证找到所有差异高达 11 的页面对,即 C 相似度 373,但可能会因较大的差异而错过一些.

这也在论文检测网络抓取的近似重复中进行了解释作者:Gurmeet Singh Manku、Arvind Jain 和 Anish Das Sarma:

<块引用>

  1. 汉明距离问题

定义:给定一组 f 位指纹和一个查询指纹 F、识别是否存在指纹与 F 的区别最多为 k 位.(在批处理模式版本中上面的问题,我们有一组查询指纹而不是单个查询指纹)

[...]

直觉:考虑一个 2 d f 位真正随机指纹的排序表.只关注最重要的 d 位在表中.这些 d 位数字的列表相当于几乎是一个计数器",因为 (a) 相当多的 2 d 位-存在组合,并且 (b) 很少有 d 位组合是重复.另一方面,最不显着的 f − d位几乎是随机的".

现在选择 d 使得 |d − d|是一个小整数.自从表已排序,单个探针足以识别在 d 个最高有效位位置与 F 匹配的所有指纹.由于 |d − d|小,这种匹配的数量也是预计会很小.对于每个匹配的指纹,我们可以很容易弄清楚它是否在最多 k 位位置与 F 不同与否(这些差异自然会仅限于f - d 最低有效位位置).

上述过程帮助我们找到一个现有的在 k 位位置与 F 不同的指纹,所有这些被限制在最不重要的 f - d 位之间F. 这处理了相当多的案例.覆盖所有的情况下,只需要建立少量的额外排序表,在下一节中正式概述.

PS:FWIW,这些优秀的大脑中的大多数在某种程度上或某个时间都与谷歌有关.

I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints.

I intend to store these in a sql database (maybe a nosql db) in the future

However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes.

Any Ideas?

解决方案

A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:

  1. create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.

  2. slice your query hash the same way in qslice 1 to 4.

  3. query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every DB hash that are within 3 bits (4 - 1) of the query hash.

  4. for each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)

The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.

This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

  1. APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE

[...]

Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+ ) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following:

For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q.

Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":

3.3 The Results for Algorithm C

We partitioned the bit string of each page into 12 non- overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one piece in common. This approach is guaranteed to find all pairs of pages with difference up to 11, i.e., C-similarity 373, but might miss some for larger differences.

This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

  1. THE HAMMING DISTANCE PROBLEM

Definition: Given a collection of f -bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint)

[...]

Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits in the table. A listing of these d-bit numbers amounts to "almost a counter" in the sense that (a) quite a few 2 d bit- combinations exist, and (b) very few d-bit combinations are duplicated. On the other hand, the least significant f − d bits are "almost random".

Now choose d such that |d − d| is a small integer. Since the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions. Since |d − d| is small, the number of such matches is also expected to be small. For each matching fingerprint, we can easily figure out if it differs from F in at most k bit-positions or not (these differences would naturally be restricted to the f − d least-significant bit-positions).

The procedure described above helps us locate an existing fingerprint that differs from F in k bit-positions, all of which are restricted to be among the least significant f − d bits of F. This takes care of a fair number of cases. To cover all the cases, it suffices to build a small number of additional sorted tables, as formally outlined in the next Section.

PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.

这篇关于数据库中的汉明距离/相似性搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆