“模糊匹配"算法字符串 [英] Algorithms for "fuzzy matching" strings
问题描述
通过模糊匹配,我不是指 Levenshtein distance 或类似的类似字符串,而是指它在 TextMate/Ido/Icicles 中的使用方式:给定一个字符串列表,找到那些包含搜索字符串中所有字符的字符串,但是可能与其他字符之间,更喜欢最合适的.
By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.
推荐答案
我终于明白你在找什么了.这个问题很有趣,但是查看您发现的 2 种算法似乎人们对规范有很大不同的看法 ;)
I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)
我认为更清楚地说明问题和要求会很有用.
I think it would be useful to state the problem and the requirements more clearly.
问题:
我们正在寻找一种方法来加快打字速度,允许用户只输入他们实际想要的关键字的几个字母,并向他们推荐一个列表供他们选择.
We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.
- 期望输入的所有字母都在关键字中
- 期望输入中的字母在关键字中的顺序相同
- 返回的关键字列表应以一致(可重复)的顺序呈现
- 算法应该不区分大小写
分析:
前两个要求可以总结如下:对于输入 axg
,我们正在寻找与此正则表达式匹配的词 [^a]*a[^x]*x[^g]*g.*
The first two requirements can be sum up like such: for an input axg
we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*
第三个要求是故意宽松的.单词出现在列表中的顺序需要保持一致……但是很难猜测评分方法是否比字母顺序更好.如果列表非常长,那么评分方法可能会更好,但是对于短列表,眼睛更容易在以明显方式排序的列表中查找特定项目.
The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.
此外,字母顺序在打字时具有一致性的优点:即添加一个字母并不会完全重新排序列表(对眼睛和大脑都很痛苦),它只会过滤掉不再匹配的项目.
Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.
处理 unicode 字符没有精确度,例如 à
类似于 a
还是其他字符?由于我不知道目前在关键字中使用此类字符的任何语言,我暂时先不说.
There is no precision about handling unicode characters, for example is à
similar to a
or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.
我的解决方案:
对于任何输入,我都会构建之前表达的正则表达式.它适用于 Python,因为该语言已经具有不区分大小写的匹配功能.
For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.
然后我会匹配我的(按字母顺序排序的)关键字列表,然后过滤输出.
I would then match my (alphabetically sorted) list of keywords, and output it so filtered.
在伪代码中:
WORDS = ['Bar', 'Foo', 'FooBar', 'Other']
def GetList(input, words = WORDS):
expr = ['[^' + i + ']*' + i for i in input]
return [w for w in words if re.match(expr, w, re.IGNORECASE)]
我本来可以使用单行的,但我认为它会使代码变得模糊;)
I could have used a one-liner but thought it would obscure the code ;)
此解决方案适用于增量情况(即,当您匹配用户类型并因此不断重建时),因为当用户添加字符时,您可以简单地重新过滤刚刚计算的结果.因此:
This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:
- 要么是字符少,要么匹配快,列表长度无所谓
- 要么有很多字符,这意味着我们正在过滤一个短列表,因此匹配需要更长的元素并没有太大关系
我还应该注意到,这个正则表达式不涉及回溯,因此非常有效.它也可以建模为一个简单的状态机.
I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.
这篇关于“模糊匹配"算法字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!