“模糊匹配"算法字符串 [英] Algorithms for "fuzzy matching" strings

查看：40 发布时间：2021/12/20 14:08:49 algorithm search string fuzzy-search

本文介绍了“模糊匹配"算法字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

通过模糊匹配，我不是指 Levenshtein distance 或类似的类似字符串，而是指它在 TextMate/Ido/Icicles 中的使用方式:给定一个字符串列表，找到那些包含搜索字符串中所有字符的字符串，但是可能与其他字符之间，更喜欢最合适的.

By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.

推荐答案

我终于明白你在找什么了.这个问题很有趣，但是查看您发现的 2 种算法似乎人们对规范有很大不同的看法 ;)

I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)

我认为更清楚地说明问题和要求会很有用.

I think it would be useful to state the problem and the requirements more clearly.

问题:

我们正在寻找一种方法来加快打字速度，允许用户只输入他们实际想要的关键字的几个字母，并向他们推荐一个列表供他们选择.

We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.

期望输入的所有字母都在关键字中
期望输入中的字母在关键字中的顺序相同
返回的关键字列表应以一致(可重复)的顺序呈现
算法应该不区分大小写

分析:

前两个要求可以总结如下:对于输入 axg，我们正在寻找与此正则表达式匹配的词 [^a]*a[^x]*x[^g]*g.*

The first two requirements can be sum up like such: for an input axg we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*

第三个要求是故意宽松的.单词出现在列表中的顺序需要保持一致……但是很难猜测评分方法是否比字母顺序更好.如果列表非常长，那么评分方法可能会更好，但是对于短列表，眼睛更容易在以明显方式排序的列表中查找特定项目.

The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.

此外，字母顺序在打字时具有一致性的优点:即添加一个字母并不会完全重新排序列表(对眼睛和大脑都很痛苦)，它只会过滤掉不再匹配的项目.

Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.

处理 unicode 字符没有精确度，例如 à 类似于 a 还是其他字符?由于我不知道目前在关键字中使用此类字符的任何语言，我暂时先不说.

There is no precision about handling unicode characters, for example is à similar to a or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.

我的解决方案:

对于任何输入，我都会构建之前表达的正则表达式.它适用于 Python，因为该语言已经具有不区分大小写的匹配功能.

For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.

然后我会匹配我的(按字母顺序排序的)关键字列表，然后过滤输出.

I would then match my (alphabetically sorted) list of keywords, and output it so filtered.

在伪代码中:

WORDS = ['Bar', 'Foo', 'FooBar', 'Other']

def GetList(input, words = WORDS):
  expr = ['[^' + i + ']*' + i for i in input]
  return [w for w in words if re.match(expr, w, re.IGNORECASE)]

我本来可以使用单行的，但我认为它会使代码变得模糊；)

I could have used a one-liner but thought it would obscure the code ;)

此解决方案适用于增量情况(即，当您匹配用户类型并因此不断重建时)，因为当用户添加字符时，您可以简单地重新过滤刚刚计算的结果.因此:

This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:

要么是字符少，要么匹配快，列表长度无所谓
要么有很多字符，这意味着我们正在过滤一个短列表，因此匹配需要更长的元素并没有太大关系

我还应该注意到，这个正则表达式不涉及回溯，因此非常有效.它也可以建模为一个简单的状态机.

I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.

这篇关于“模糊匹配"算法字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

“模糊匹配"算法字符串 [英] Algorithms for "fuzzy matching" strings

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

“模糊匹配"算法字符串 [英] Algorithms for &quot;fuzzy matching&quot; strings

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

“模糊匹配"算法字符串 [英] Algorithms for "fuzzy matching" strings

登录关闭