对一组测试最小汉明距离的算法? [英] Algorithm to test minimum hamming distance against a set?

查看:135
本文介绍了对一组测试最小汉明距离的算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做一个相对简单的事情:




  • 给出一个查询编号Q,一个查询距离d和一个集合S中的数字,确定S是否包含汉明距离小于或等于d的任何数字。



最简单的解决方案是仅使S成为列表并对其进行迭代,从而计算距离。如果计算出的距离小于或等于d,则将返回值设置为TRUE。



但是考虑到我要做的就是检查是否存在,这比a更快。

我尝试过的一件事是M树。引用有关stackoverflow的其他一些问题,维基百科文章( https://en.wikipedia.org/wiki/ M-tree )和两个预先存在的实现,昨天我花了几个小时来实现自定义解决方案。关于此问题的好处之一是,与存储避免计算指标的数字相比,通过两个数字(使用SSE指令)的XOR来计算popcount实际上要便宜得多,因此该解决方案的各个方面可以简化和优化速度。



结果令人失望。事实证明,与最小汉明距离相比,我要处理的公制半径较小。例如,在12位数字的空间中,最大汉明距离为12。如果我要寻找的最小汉明距离为4,则不会有太多机会进行良好的非重叠分区。实际上,我只是尝试过,用蛮力创建了一组最小汉明距离为4的12位数字,然后(用蛮力)找到了最佳的二叉树分区,以便搜索算法可以访问最少数量的节点。如果我想计算查询中d个集合元素的数量,我不能将节点访问的数量减少到总数的30%以下,而当我发现第一个访问时就停止访问量约4%。这意味着我已经或多或少地提出了线性时间解决方案,其中精心设计的树搜索算法的开销与不必检查多个set成员所节省的费用大致相同。



但是我想做的事情非常有限。我什至不想计算查询距离< = d的集合成员的数量,更不用说枚举它们了。我只想检查是否存在。这使我想到了诸如bloom filter和hash之类的东西。



我还考虑过尝试建立一个图形结构,在该结构中,集合成员通过具有权重的边连接。利用汉明距离考虑三角形不等式这一事实,在我看来,必须找到某种方法来搜索该图,以使边沿遍历的方向可能距查询更小,但我什至不知道从哪里开始这里。



这里有人对解决方案有其他建议,可能会轻易击败简单迭代数组的性能吗?



编辑和动机:



最终,这来自于编码理论问题。对于给定的偶数d和字长N,我可以在N位数字中放入多少个汉明距离为d的代码?这样就可以创建一个代码,该代码可以检测d / 2位的错误,纠正高达d / 2-1位的错误。我们知道像LDPC这样的Shannon极限码,但这是针对具有最小汉明距离的长码,它们需要花费很长时间才能解码。还有像OLSC这样的多位错误代码,它们可以快速解码,但远非节省空间。另一方面,对于d = 4,扩展汉明(SECDED)码是最佳紧凑的。我见过基于BCH的方法来编写DECTED代码,但是我不知道它们是否是最佳方法。为了探索最佳编码,我想做的是生成带有任意d的N位代码的替代集,并生成对它们进行编码和解码的电路,选择最紧凑的。我也希望能找到一些可以用于较长代码的模式。



如果(a)尚未完成,(b)可行,并且(c)有人想合写论文,请让我知道。 :)

解决方案

我认为可以通过将每个数字从 S 拆分为子字符串来解决问题这样查询结果必须具有至少1个分区,且汉明距离与查询的相应分区不超过1。



此算法在本文中进行了描述: Alex X. Liu,Ke Shen,Eric Torng。大规模汉明距离查询处理,2011年。作者称该算法为HEngine。我试图解释一些直觉。



N -位数(维数)



k -查询汉明距离



r-cut(α)-分割数α插入 r 子字符串 {α1,α2,...,αr} ,其中第一个 r-(m mod r)子字符串的长度⌊m/r⌋和最后一个 m mod r 子字符串的长度为⌈m/r⌉



该算法基于定理:



对于任意两个二进制字符串βγ使得 HD(β,γ)≤k ,考虑 r-cut(β) r-cut(γ),其中 r≥⌊k/2⌋+ 1 。在至少 q = r-⌊k/2⌋不同的 i 值的情况下, HD(βi,γi)≤1 >。



例如,我们有长度为 N = 8 位的二进制字符串。我们想找到 k = 2 的子字符串。

 α= 10001110 
β= 10100110
HD(α,β)= 2

r =⌊2/2⌋+ 1 = 2 。在这种情况下, r-cut(α,β)产生2个长度为4位的子字符串:

  α1= 1000α2= 1110 
β1= 1010β2= 0110
HD(α1,β1)= 1,HD(α2,β2)= 1

q = 2-⌊2/ 2 1 = 1



作者还介绍了下一个定理:



考虑任意字符串β∈T 使得 HD(α,β)≤k 。给定任何 r≥k/2⌋+ 1 ,可以得出至少一个签名β-签名与其兼容的签名α-签名匹配



该算法的基本思想是对 S 进行预处理,以方便查找β中的所有字符串。 S 满足签名匹配属性,然后验证这些字符串中的哪些实际上在α的汉明距离 k 内。



我想您应该使用HEngine算法为子表准备 S 的集合,并以相同的方式拆分 Q 进行分区。然后在考虑相应分区的汉明距离不超过1的情况下,根据相应分区执行搜索。



请注意,请参见文章。


I have a relative straightforward thing I want to do:

  • Given a query number Q, a query distance d, and a set of numbers S, determine whether or not S contains any numbers with Hamming distance less than or equal to d.

The simplest solution is to just make S a list and iterate over it, computing distances. If a distance less than or equal d is computed, bail out an return TRUE.

But considering that all I want to do is check for an existence, something faster than a linear time solution should be possible.

One thing I tried is an M-tree. Referencing some other questions on stackoverflow, the wikipedia article (https://en.wikipedia.org/wiki/M-tree) and two pre-existing implementations, I spent several hours yesterday implementing a custom solution. One of the nice things about this problem is that it's actually cheaper to compute popcount over the XOR of two numbers (using an SSE instruction) than to store numbers that would allow avoidance of computing the metric, so there are several aspects of the solution that could be simplified and optimized for speed.

The results were very disappointing. It turns out that the metric radius I'm dealing with is small compared to the minimum Hamming distance. For instance, in the space of 12 bit numbers, the maximum Hamming distance is 12. If the minimum I'm looking for is 4, that doesn't leave much opportunity for good non-overlapping partitioning. In fact, I tried just that, creating by brute force a set of 12-bit numbers with min Hamming distance of 4 and then (by brute force) finding optimal binary tree partitioning so that a search algorithm could visit a minimum number of nodes. If I want to count the number of set elements within d of the query, I can't reduce the number of node visits below about 30% of the total, and stopping when I find the first has it visit about 4%. This means that I've more or less made a linear-time solution, where the overhead of the elaborate tree search algorithm is about the same as the savings from not having to check as many set members.

But what I want to do is very limited. I don't want to even count the number of set members with query distance <= d, much less enumerate them. I just want to check for existence. This makes me think about things like bloom filters and hashes.

I've also thought about trying to build a graph structure where set members are connected by edges with weights. Using the fact that Hamming distance respects triangle inequality, it seems to me there must be some way to search this graph such that edge traversals lead in a direction of likely smaller distance to the query, but I don't even really know where to start here.

Does anyone have any other suggestions for a solution here that might handily beat the performance of simply iterating an array?

EDIT and MOTIVATION:

Ultimately this comes from a coding theory question. For a given even number d and word size N, how many codes with min hamming distance d can I fit into an N-bit number? This allows the creation of a code that can detect errors of d/2 bits correct errors up to d/2-1 bits. We know about Shannon-limit codes like LDPC, but this is for long codes with nebulous min Hamming distance, and they take forever to decode. There are also multi-bit-error codes like OLSC that are fast to decode, but they're far from space-efficient. On the other hand, for d=4, extended Hamming (SECDED) codes are optimally compact. I've seen BCH-based methods to make a DECTED code, but I don't know if they're optimal. To explore optimal encodings, what I wanted to do was generate alternative sets of codes of N bits with some arbitrary d and generate circuits to encode and decode them, selecting the most compact. I was also hoping to find some patterns that we might exploit for longer codes.

If this is (a) not already done, (b) feasible, and (c) someone would like to co-author a paper, please let me know. :)

解决方案

I think that problem may be resolved by the splitting each numbers from S to substrings such that the query results must have at least 1 partition whose Hamming distance is no more than 1 with the corresponding partitions of the query.

This algorithm is described in the article: Alex X. Liu, Ke Shen, Eric Torng. Large scale Hamming distance query processing, 2011. The authors are called the algorithm as HEngine. I try to explain some intuition.

Lets N - bit count of the number (it dimensionality)

k - query Hamming distance

r-cut(α) - function of splitting number α into r substring {α1, α2, ..., αr} where the first r − (m mod r) substrings have length ⌊m/r⌋ and the last m mod r substrings have length ⌈m/r⌉

The algorithm is based on the theorem:

For any two binary strings β and γ such that HD(β, γ) ≤ k, consider r-cut(β) and r-cut(γ) where r ≥ ⌊k/2⌋ + 1. It must be the case that HD(βi, γi) ≤ 1 for at least q = r − ⌊k/2⌋ different values of i.

For example, we have binary string of length N = 8 bits. And we would like to find substrings with k = 2.

α = 10001110
β = 10100110
HD(α, β) = 2

Then minimum value of r = ⌊2/2⌋ + 1 = 2. In this case r-cut(α,β) produces 2 substrings of length 4 bits:

    α1 = 1000    α2 = 1110
    β1 = 1010    β2 = 0110
HD(α1, β1) = 1,  HD(α2, β2) = 1

q = 2 - ⌊2/2⌋ = 1.

Also the authors introduced the next theorem:

Consider any string β ∈ T such that HD(α, β) ≤ k. Given any r ≥ ⌊k/2⌋ + 1, it follows that at least one signature β-signature matches its compatible signature α-signature.

The basic idea of the algorithm is to preprocess S to facilitate finding all strings β in S that satisfy the signature match property and then verify which of these strings actually are within Hamming distance k of α.

I suppose you should prepare the set of S to subtables using HEngine algorithm, and split Q to partitions the same way. And then perform the search by corresponding partitions taking into account that the Hamming distance is no more than 1 with the corresponding partitions.

Please I advise you to see more details in the article.

这篇关于对一组测试最小汉明距离的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆