近似​​字符串匹配算法 [英] Approximate string matching algorithms

查看:135
本文介绍了近似​​字符串匹配算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这里,在工作中,我们经常需要找到从字符串列表是最匹配的一些其他输入字符串的字符串。目前,我们正在使用EMBOSS程序包。该算法往往返回大量的假阳性(如果我们设定的最低分数太低),有时它没有找到一个匹配时,它应(当最低分数太高),并且大部分的时间,我们需要手工检查结果。我们认为,我们应该尝试其他选择。

Here at work, we often need to find a string from the list of strings that is the closest match to some other input string. Currently, we are using Needleman-Wunsch algorithm. The algorithm often returns a lot of false-positives (if we set the minimum-score too low), sometimes it doesn't find a match when it should (when the minimum-score is too high) and, most of the times, we need to check the results by hand. We thought we should try other alternatives.

你有没有用的算法任何经验? 你知道算法如何比较彼此?

Do you have any experiences with the algorithms? Do you know how the algorithms compare to one another?

我真的AP preciate一些建议。

I'd really appreciate some advice.

PS:我们的编码在C#中,但你不应该关心它 - 我问一般的算法

PS: We're coding in C#, but you shouldn't care about it - I'm asking about the algorithms in general.

哦,对不起,我忘了提及这一点。

Oh, I'm sorry I forgot to mention that.

不,我们不使用它来匹配重复的数据。我们有,我们正在寻找一个字符串列表 - 我们称之为搜索列表。然后,我们需要处理来自各种来源的文本(如RSS订阅,网站,论坛等) - 我们提取这些文本的部分(也有该规则的整个集,但是这无关紧要),我们需要匹配那些对搜索列表。如果字符串匹配的搜索列表中的字符串之一 - 我们需要做的事情的一些进一步的处理(这也是无关)

No, we're not using it to match duplicate data. We have a list of strings that we are looking for - we call it search-list. And then we need to process texts from various sources (like RSS feeds, web-sites, forums, etc.) - we extract parts of those texts (there are entire sets of rules for that, but that's irrelevant) and we need to match those against the search-list. If the string matches one of the strings in search-list - we need to do some further processing of the thing (which is also irrelevant).

我们不能执行正常的比较,因为从外部来源提取的字符串,大部分的时间,包括一些额外的字等等。

We can not perform the normal comparison, because the strings extracted from the outside sources, most of the times, include some extra words etc.

总之,它不是重复检测。

Anyway, it's not for duplicate detection.

推荐答案

OK,内德勒曼 - 文施(NW)是一个典型的端至端(全球),从生物信息学文献对准。这是很久以前可作为对齐和align0的FASTA包。所不同的是,0的版本是没有偏见如何避免最终定寸,往往让利于高质量的内部比赛更容易。史密斯 - 沃特曼,我怀疑你是知道的,是当地的对准,是高炉原来的基础上。 FASTA有它自己的地方对准的是,这种稍有不同。所有这些都是基本的启发式方法估计相关Levenshtein距离到一个得分度量个性对(在生物信息学,通常由Dayhoff /PAM,Henikoff和定; Henikoff,或其它基质和通常与一些更简单和更合理地反射取代当应用到自然语言)。在语言文字形态的更新换代

OK, Needleman-Wunsch(NW) is a classic end-to-end ("global") aligner from the bioinformatics literature. It was long ago available as "align" and "align0" in the FASTA package. The difference was that the "0" version wasn't as biased about avoiding end-gapping, which often allowed favoring high-quality internal matches easier. Smith-Waterman, I suspect you're aware, is a local aligner and is the original basis of BLAST. FASTA had it's own local aligner as well that was slightly different. All of these are essentially heuristic methods for estimating Levenshtein distance relevant to a scoring metric for individual character pairs (in bioinformatics, often given by Dayhoff/"PAM", Henikoff&Henikoff, or other matrices and usually replaced with something simpler and more reasonably reflective of replacements in linguistic word morphology when applied to natural language).

让我们无法左右的标签precious:莱文斯坦距离,引用在实践中,至少基本上是编辑距离,你必须评估它,因为它是不可行的一般计算它,而且它的价格昂贵,即使在完全计算有趣的特殊情况:水变深快速出现,因此,我们有长期和良好的声誉启发式方法

Let's not be precious about labels: Levenshtein distance, as referenced in practice at least, is basically edit distance and you have to estimate it because it's not feasible to compute it generally, and it's expensive to compute exactly even in interesting special cases: the water gets deep quick there, and thus we have heuristic methods of long and good repute.

现在以你自己的问题:几年前,我曾检查短的DNA的读取精度对已知正确的参考序列和我想出了东西,我称之为锚定路线

Now as to your own problem: several years ago, I had to check the accuracy of short DNA reads against reference sequence known to be correct and I came up with something I called "anchored alignments".

我们的想法是把你的参考字符串集和消化它被发现,其中一个指定N个字符的字符串出现的所有位置。请选择N使你建立的表不是太大,也使长度为N子是不是太常见。对于小字母像DNA碱基,有可能拿出的N个字符的字符串一个完美的散列,并表和链的比赛在一个链表中的每个箱。列表中的条目必须确定顺序并启动映射到bin在其列表发生的子串的位置。这些都是在字符串列表中的锚来进行搜索,在其中NW调整很可能是有用的。

The idea is to take your reference string set and "digest" it by finding all locations where a given N-character substring occurs. Choose N so that the table you build is not too big but also so that substrings of length N are not too common. For small alphabets like DNA bases, it's possible to come up with a perfect hash on strings of N characters and make a table and chain the matches in a linked list from each bin. The list entries must identify the sequence and start position of the substring that maps to the bin in whose list they occur. These are "anchors" in the list of strings to be searched at which an NW alignment is likely to be useful.

在处理查询字符串,你需要开始在一些N个字符的偏移K的查询字符串,哈希他们,查找其斌,如果列表中说,本不为空,那么你经历所有的清单记录和执行查询串与在记录中所引用的搜索字符串之间比对。当执行这些比对,则排队的查询字符串,并在搜索字符串的 的锚定和提取搜索字符串,它是长度相同查询串与其中的一个子包含锚在同一偏移,K。

When processing a query string, you take the N characters starting at some offset K in the query string, hash them, look up their bin, and if the list for that bin is nonempty then you go through all the list records and perform alignments between the query string and the search string referenced in the record. When doing these alignments, you line up the query string and the search string at the anchor and extract a substring of the search string that is the same length as the query string and which contains that anchor at the same offset, K.

如果你选择一个足够长的锚固长度为N,和一套合理的补偿的K值的(他们可以在整个查询字符串s $ P $垫或只限于低失调),你应该得到的可能路线的一个子集而且往往会得到更清晰的赢家。通常情况下,你会想使用较少最终偏见align0般的NW对齐。

If you choose a long enough anchor length N, and a reasonable set of values of offset K (they can be spread across the query string or be restricted to low offsets) you should get a subset of possible alignments and often will get clearer winners. Typically you will want to use the less end-biased align0-like NW aligner.

这个方法试图提高西北有点通过限制它的投入,这有性能提升,因为你少做比对,他们更经常相约序列之间。另一件好事,与你的净重对准是允许它一定金额或间隙的长度后放弃时削减成本,尤其是当你知道你不会看到或有兴趣中等质量的比赛。

This method tries to boost NW a bit by restricting it's input and this has a performance gain because you do less alignments and they are more often between similar sequences. Another good thing to do with your NW aligner is to allow it to give up after some amount or length of gapping occurs to cut costs, especially if you know you're not going to see or be interested in middling-quality matches.

最后,这种方法用于与小字母的系统上,使用K限于在查询串和与搜索字符串比查询大得多的第一100个左右的位置(该DNA读取是大约1000个碱基,并且搜索字符串是10000的顺序,所以我一直在寻找通过编辑距离的估计,特别是)合理的近似字符串匹配项。适应这种方法自然语言将需要一些认真思考:你失去了对字母的大小,但你获得,如果你的查询字符串和搜索字符串是类似的长度

Finally, this method was used on a system with small alphabets, with K restricted to the first 100 or so positions in the query string and with search strings much larger than the queries (the DNA reads were around 1000 bases and the search strings were on the order of 10000, so I was looking for approximate substring matches justified by an estimate of edit distance specifically). Adapting this methodology to natural language will require some careful thought: you lose on alphabet size but you gain if your query strings and search strings are of similar length.

无论哪种方式,允许从同时使用可能是在供给到NW进一步过滤数据有帮助的查询字符串的不同端一个以上的锚。如果你这样做,是ppared到可能重叠发送各含两个锚点,以矫正的一个字符串,然后调和路线...或者可能进一步修改净重强调保持你的锚用点球对准期间大多保存完好$ P $该算法的执行过程中进行修改。

Either way, allowing more than one anchor from different ends of the query string to be used simultaneously might be helpful in further filtering data fed to NW. If you do this, be prepared to possibly send overlapping strings each containing one of the two anchors to the aligner and then reconcile the alignments... or possibly further modify NW to emphasize keeping your anchors mostly intact during an alignment using penalty modification during the algorithm's execution.

希望这是有帮助的,或者至少有趣。

Hope this is helpful or at least interesting.

这篇关于近似​​字符串匹配算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆