“模糊匹配"算法字符串 [英] Algorithms for "fuzzy matching" strings

查看:40
本文介绍了“模糊匹配"算法字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过模糊匹配,我不是指 Levenshtein distance 或类似的类似字符串,而是指它在 TextMate/Ido/Icicles 中的使用方式:给定一个字符串列表,找到那些包含搜索字符串中所有字符的字符串,但是可能与其他字符之间,更喜欢最合适的.

By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.

推荐答案

我终于明白你在找什么了.这个问题很有趣,但是查看您发现的 2 种算法似乎人们对规范有很大不同的看法 ;)

I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)

我认为更清楚地说明问题和要求会很有用.

I think it would be useful to state the problem and the requirements more clearly.

问题:

我们正在寻找一种方法来加快打字速度,允许用户只输入他们实际想要的关键字的几个字母,并向他们推荐一个列表供他们选择.

We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.

  1. 期望输入的所有字母都在关键字中
  2. 期望输入中的字母在关键字中的顺序相同
  3. 返回的关键字列表应以一致(可重复)的顺序呈现
  4. 算法应该不区分大小写

分析:

前两个要求可以总结如下:对于输入 axg,我们正在寻找与此正则表达式匹配的词 [^a]*a[^x]*x[^g]*g.*

The first two requirements can be sum up like such: for an input axg we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*

第三个要求是故意宽松的.单词出现在列表中的顺序需要保持一致……但是很难猜测评分方法是否比字母顺序更好.如果列表非常长,那么评分方法可能会更好,但是对于短列表,眼睛更容易在以明显方式排序的列表中查找特定项目.

The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.

此外,字母顺序在打字时具有一致性的优点:即添加一个字母并不会完全重新排序列表(对眼睛和大脑都很痛苦),它只会过滤掉不再匹配的项目.

Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.

处理 unicode 字符没有精确度,例如 à 类似于 a 还是其他字符?由于我不知道目前在关键字中使用此类字符的任何语言,我暂时先不说.

There is no precision about handling unicode characters, for example is à similar to a or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.

我的解决方案:

对于任何输入,我都会构建之前表达的正则表达式.它适用于 Python,因为该语言已经具有不区分大小写的匹配功能.

For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.

然后我会匹配我的(按字母顺序排序的)关键字列表,然后过滤输出.

I would then match my (alphabetically sorted) list of keywords, and output it so filtered.

在伪代码中:

WORDS = ['Bar', 'Foo', 'FooBar', 'Other']

def GetList(input, words = WORDS):
  expr = ['[^' + i + ']*' + i for i in input]
  return [w for w in words if re.match(expr, w, re.IGNORECASE)]

我本来可以使用单行的,但我认为它会使代码变得模糊;)

I could have used a one-liner but thought it would obscure the code ;)

此解决方案适用于增量情况(即,当您匹配用户类型并因此不断重建时),因为当用户添加字符时,您可以简单地重新过滤刚刚计算的结果.因此:

This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:

  • 要么是字符少,要么匹配快,列表长度无所谓
  • 要么有很多字符,这意味着我们正在过滤一个短列表,因此匹配需要更长的元素并没有太大关系

我还应该注意到,这个正则表达式不涉及回溯,因此非常有效.它也可以建模为一个简单的状态机.

I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.

这篇关于“模糊匹配"算法字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆