找到最相似的字符串输入最快的方法? [英] Fastest way to find most similar string to an input?

查看:330
本文介绍了找到最相似的字符串输入最快的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于长度为N的查询串Q,和长度的m序列的列表L确切的N,什么是最有效的算法来查找字符串L中以最少的不匹配位置至Q?例如:

Given a query string Q of length N, and a list L of M sequences of length exactly N, what is the most efficient algorithm to find the string in L with the fewest mismatch positions to Q? For example:

Q = "ABCDEFG";
L = ["ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT"];
answer = L.query(Q);  # Returns "ABCCEFG"
answer2 = L.query("AAAATAA");  #Returns "AAAAAAA".

最显而易见的方法是扫描每一个序列中L,使搜索取O(M * N)。有没有办法做到这一点在次线性时间呢?我不在乎,如果有一个大量的前期成本,组织L放入一些数据结构,因为它会被询问了很多次。此外,擅自处理追平比分是好的。

The obvious way is to scan every sequence in L, making the search take O(M * N). Is there any way to do this in sublinear time? I don't care if there's a large upfront cost to organizing L into some data structure because it will be queried a lot of times. Also, handling tied scores arbitrarily is fine.

编辑:为了澄清,我正在寻找的汉明距离

To clarify, I am looking for the Hamming distance.

推荐答案

地点敏感的哈希 underlies似乎是渐近最好的方法已知的,按照我的理解,从这个<一个href="http://mags.acm.org/communications/200801/?searchterm=nearest%2Bneighbor%2Bin%2Bhigh%2Bdimensions&pg=119"相对=nofollow>在中美洲共同市场的评论文章。说文章是pretty的毛毛,我没有读过这一切。另请参见近邻搜索

Locality sensitive hashing underlies what seems to be the asymptotically best method known, as I understand it from this review article in CACM. Said article is pretty hairy and I didn't read it all. See also nearest neighbor search.

要涉及这些引用您的问题:他们都处理的度量空间,一组点,如n维向量空间。在您的问题,n为每个字符串的长度,并在每个坐标的值是可以出现在字符串中的每个位置的字符。

To relate these references to your problem: they all deal with a set of points in a metric space, such as an n-dimensional vector space. In your problem, n is the length of each string, and the values on each coordinate are the characters that can appear at each position in a string.

这篇关于找到最相似的字符串输入最快的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆