在大集合中有效地找到具有低汉明距离的二进制字符串 [英] Efficiently find binary strings with low Hamming distance in large set

查看:24
本文介绍了在大集合中有效地找到具有低汉明距离的二进制字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

给定一个大型(约 1 亿)无符号 32 位整数列表、一个无符号 32 位整数输入值和一个最大值 汉明距离,返回输入值指定汉明距离内的所有列表成员.

保存列表的实际数据结构是开放的,性能要求决定了内存中的解决方案,构建数据结构的成本是次要的,查询数据结构的低成本至关重要.

示例:

最大汉明距离为 1(值通常会非常小)并输入:00001000100000000000000001111101价值:0100100010000000000000000111110100001000100000000010000001111101应该匹配,因为只有 1 个位置不同.11001000100000000010000001111101不应该匹配,因为 3 位位置不同.

我目前的想法:

对于汉明距离为 0 的退化情况,只需使用排序列表并对特定输入值进行二分搜索.

如果汉明距离只有 1,我可以翻转原始输入中的每一位并重复上述 32 次.

我如何有效地(不扫描整个列表)发现汉明距离 > 1 的列表成员.

解决方案

问题:我们对汉明距离 d(x,y) 了解多少?

答案:

  1. 非负:d(x,y) ≥ 0
  2. 对于相同的输入它只为零:d(x,y) = 0 ⇔ x = y
  3. 它是对称的:d(x,y) = d(y,x)
  4. 它遵守三角不等式,d(x,z) ≤ d(x,y) + d(y,z)

问题:我们为什么要关心?

答案:因为这意味着汉明距离是度量空间度量.有索引度量空间的算法.

您还可以查找一般的空间索引"算法,了解您的空间不是欧几里得空间,但它是一个度量空间.许多关于此主题的书籍都使用汉明距离等指标进行字符串索引.

脚注:如果您要比较固定宽度字符串的汉明距离,则可以通过使用汇编或处理器内在函数来获得显着的性能改进.例如,使用 GCC(manual)这样做:

static inline int distance(unsigned x, unsigned y){返回 __builtin_popcount(x^y);}

如果您随后通知 GCC 您正在为具有 SSE4a 的计算机进行编译,那么我认为应该减少到几个操作码.

根据许多消息来源,这有时/通常比通常的掩码/移位/添加代码慢.基准测试表明,在我的系统上,C 版本的性能比 GCC 的 __builtin_popcount 高出约 160%.

附录:我自己对这个问题很好奇,所以我分析了三个实现:线性搜索、BK 树和 VP 树.请注意,VP 和 BK 树非常相似.BK 树中节点的子节点是树的壳",其中包含与树中心距离固定的点.VP 树中的一个节点有两个子节点,一个包含以节点中心为中心的球体内的所有点,另一个子节点包含球体外的所有点.所以你可以把一个 VP 节点想象成一个 BK 节点,它有两个非常厚的壳",而不是许多更细的壳".

结果是在我的 3.2 GHz PC 上捕获的,并且算法不会尝试使用多个内核(这应该很容易).我选择了 100M 伪随机整数的数据库大小.结果是距离为 1..5 的 1000 次查询的平均值,以及距离为 6..10 和线性搜索的 100 次查询的平均值.

  • 数据库:100M 个伪随机整数
  • 测试次数:距离 1..5 为 1000,距离 6..10 和线性为 100
  • 结果:平均查询命中数(非常近似)
  • 速度:每秒查询次数
  • 覆盖率:每次查询检查的数据库的平均百分比
<前>-- BK 树 -- -- VP 树 -- -- 线性 --Dist 结果 Speed Cov Speed Cov Speed Cov1 0.90 3800 0.048% 4200 0.048%2 11 300 0.68% 330 0.65%3 130 56 3.8% 63 3.4%4 970 18 12% 22 10%5 5700 8.5 26% 10 22%6 2.6e4 5.2 42% 6.0 37%7 1.1e5 3.7 60% 4.1 54%8 3.5e5 3.0 74% 3.2 70%9 1.0e6 2.6 85% 2.7 82%10 2.5e6 2.3 91% 2.4 90%任何 2.2 100%

在您的评论中,您提到:

<块引用>

我认为可以通过生成一堆具有不同根节点的 BK 树并将它们展开来改进 BK 树.

我认为这正是 VP 树表现(略)优于 BK 树的原因.更深"而不是更浅",它与更多的点进行比较,而不是对更少的点使用更细粒度的比较.我怀疑在高维空间中差异更加极端.

最后一个提示:树中的叶节点应该只是用于线性扫描的整数平面数组.对于小集合(可能是 1000 点或更少),这会更快,内存效率更高.

Problem:

Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.

Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.

Example:

For a maximum Hamming Distance of 1 (values typically will be quite small)

And input: 
00001000100000000000000001111101

The values:
01001000100000000000000001111101 
00001000100000000010000001111101 

should match because there is only 1 position in which the bits are different.

11001000100000000010000001111101

should not match because 3 bit positions are different.

My thoughts so far:

For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.

If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.

How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.

解决方案

Question: What do we know about the Hamming distance d(x,y)?

Answer:

  1. It is non-negative: d(x,y) ≥ 0
  2. It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
  3. It is symmetric: d(x,y) = d(y,x)
  4. It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)

Question: Why do we care?

Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.

You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.

Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:

static inline int distance(unsigned x, unsigned y)
{
    return __builtin_popcount(x^y);
}

If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.

Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.

Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.

The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.

  • Database: 100M pseudorandom integers
  • Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
  • Results: Average # of query hits (very approximate)
  • Speed: Number of queries per second
  • Coverage: Average percentage of database examined per query

                -- BK Tree --   -- VP Tree --   -- Linear --
Dist    Results Speed   Cov     Speed   Cov     Speed   Cov
1          0.90 3800     0.048% 4200     0.048%
2         11     300     0.68%   330     0.65%
3        130      56     3.8%     63     3.4%
4        970      18    12%       22    10%
5       5700       8.5  26%       10    22%
6       2.6e4      5.2  42%        6.0  37%
7       1.1e5      3.7  60%        4.1  54%
8       3.5e5      3.0  74%        3.2  70%
9       1.0e6      2.6  85%        2.7  82%
10      2.5e6      2.3  91%        2.4  90%
any                                             2.2     100%

In your comment, you mentioned:

I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.

I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.

A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.

这篇关于在大集合中有效地找到具有低汉明距离的二进制字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆