找到最相似的字符串输入最快的方法? [英] Fastest way to find most similar string to an input?
问题描述
由于长度为N的查询串Q,和长度的m序列的列表L确切的N,什么是最有效的算法来查找字符串L中以最少的不匹配位置至Q?例如:
Given a query string Q of length N, and a list L of M sequences of length exactly N, what is the most efficient algorithm to find the string in L with the fewest mismatch positions to Q? For example:
Q = "ABCDEFG";
L = ["ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT"];
answer = L.query(Q); # Returns "ABCCEFG"
answer2 = L.query("AAAATAA"); #Returns "AAAAAAA".
最显而易见的方法是扫描每一个序列中L,使搜索取O(M * N)。有没有办法做到这一点在次线性时间呢?我不在乎,如果有一个大量的前期成本,组织L放入一些数据结构,因为它会被询问了很多次。此外,擅自处理追平比分是好的。
The obvious way is to scan every sequence in L, making the search take O(M * N). Is there any way to do this in sublinear time? I don't care if there's a large upfront cost to organizing L into some data structure because it will be queried a lot of times. Also, handling tied scores arbitrarily is fine.
编辑:为了澄清,我正在寻找的汉明距离
To clarify, I am looking for the Hamming distance.
推荐答案
地点敏感的哈希 underlies似乎是渐近最好的方法已知的,按照我的理解,从这个<一个href="http://mags.acm.org/communications/200801/?searchterm=nearest%2Bneighbor%2Bin%2Bhigh%2Bdimensions&pg=119"相对=nofollow>在中美洲共同市场的评论文章。说文章是pretty的毛毛,我没有读过这一切。另请参见近邻搜索。
Locality sensitive hashing underlies what seems to be the asymptotically best method known, as I understand it from this review article in CACM. Said article is pretty hairy and I didn't read it all. See also nearest neighbor search.
要涉及这些引用您的问题:他们都处理的度量空间,一组点,如n维向量空间。在您的问题,n为每个字符串的长度,并在每个坐标的值是可以出现在字符串中的每个位置的字符。
To relate these references to your problem: they all deal with a set of points in a metric space, such as an n-dimensional vector space. In your problem, n is the length of each string, and the values on each coordinate are the characters that can appear at each position in a string.
这篇关于找到最相似的字符串输入最快的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!