找到最小的汉明距离任何字符串最快的方法? [英] Fastest way to find minimal Hamming distance to any substring?

查看:219
本文介绍了找到最小的汉明距离任何字符串最快的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于长字符串和一个较短的字符串取值(约束是 .length必须> = 取值 .length),我想找到的取值长度等于取值 .length任何字符串。让我们调用函数这个 minHamming()。例如,

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,

minHamming(ABCDEFGHIJ,CDEFGG)== 1

minHamming(ABCDEFGHIJ,BCDGHI)== 3

这样做的明显的方式(列举升每个子)需要O(取值 .length * 。长)的时间。有没有什么聪明的办法做到这一点在次线性时间呢?我搜索了相同的有几个不同的取值字符串,所以做一些复杂的preprocessing到一次是可以接受的。

Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.

编辑:修改后的博耶,摩尔将是一个不错的主意,但我的字母表只有4个字母(DNA)

The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).

推荐答案

令人惊讶的是,这个确切的问题是可以解决的在短短O(| A | n日志n)的时间使用快速傅立叶变换(FFT ),其中n是较大序列的长度 | A | 是字母的大小

Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.

下面是一纸由唐纳德·本森一个免费的PDF描述它是如何工作的:

Here is a freely available PDF of a paper by Donald Benson describing how it works:

  • Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)

摘要:转换您的每一个字符串取值成若干指示器载体的(每个字符之一,所以在4 DNA的情况下),然后卷积的对应矢量来确定匹配计数为每个可能的调整。诀窍是在时间域,它通常需要为O(n ^ 2)时间卷积,可以在频率域,它仅需要O(n)的时间用乘法来实现,再加上所需的时间来转换域之间,然后再返回。使用每次转换只需要O(n日志n)时间的FFT,所以总的时间复杂度为O(| A | n日志N)。为了最大限度地提高速度,有限域的FFT的使用,其中只需要整数运算。

Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.

注意:对于任意取值这个算法显然是一个巨大的性能险胜简单的O(MN)算法 | S | | L | 变大,但OTOH如果取值通常比登录较短| L | (例如查询大型数据库的小程序的时候),那么显然这个方法提供任何加速。

Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.

更新21/7/2009 :已更新一提的是时间复杂度也线性依赖于字母表的大小,因为在字母表中必须使用的每个字符单独对指标向量

UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.

这篇关于找到最小的汉明距离任何字符串最快的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆