算法"模糊匹配"字符串 [英] Algorithms for "fuzzy matching" strings

查看:168
本文介绍了算法"模糊匹配"字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过模糊匹配我的意思不是类似字符串由Levenshtein距离或类似的东西,但它在TextMate中/伊多文/冰柱使用方法:给定一个字符串列表,找到那些包括在搜索字符串中的所有字符,但可能与之间,preferring最合适的。

By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.

推荐答案

我终于明白了什么,你所期待的。这个问题很有趣但是在看2算法,你发现它似乎人们对规范广泛不同意见;)

I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)

我认为这将是有益的陈述问题和要求更清楚。

I think it would be useful to state the problem and the requirements more clearly.

问题

我们正在寻找一种方式,允许用户只输入关键字他们居然打算的几个字母,并提出他们从中选择一个列表,加快打字。

We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.

  1. 预期的输入的所有字母是在关键字
  2. 据预计,在输入的字母在同一个顺序在关键字
  3. 关键字的列表中返回应psented一致(reproductible)为了$ P $
  4. 在该算法应不区分大小写

分析

前两个要求可以总结一下像这样:对于输入 AXG 我们正在寻找的话匹配此正EX pression [^ A] * A [^ X] * X [^ G] * G *

The first two requirements can be sum up like such: for an input axg we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*

第三个要求是故意松。其中是一致的话应该出现在列表中需要的命令......但它很难猜测评分方法是否会比按字母顺序排列好。如果列表extremy长,则评分方法可能会更好,但对于短名单更容易让眼睛去寻找下一个列表排序明显地特定项目。

The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.

此外,按字母顺序排列具有一致性的优势打字时:即增加一个字母不完全列表重新排序(痛苦的眼睛和大脑),它只是过滤掉不符合任何更长的项目

Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.

有一个关于处理UNI code字符,例如没有precision是 A 类似于 A 或其它字符完全?因为我知道,目前使用这些字符在他们的关键字没有语言的,我会让它溜走现在。

There is no precision about handling unicode characters, for example is à similar to a or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.

我的解决办法:

对于任何投入,我将建立早期pssed正规前pression EX $ P $。它适用于Python的,因为语言已经拥有不区分大小写匹配。

For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.

我会再匹配关键字的我(按字母顺序排序)列表,并输出使过滤。

I would then match my (alphabetically sorted) list of keywords, and output it so filtered.

在伪code:

WORDS = ['Bar', 'Foo', 'FooBar', 'Other']

def GetList(input, words = WORDS):
  expr = ['[^' + i + ']*' + i for i in input]
  return [w for w in words if re.match(expr, w, re.IGNORECASE)]

我可以用一个一行但认为它会掩盖了code;)

I could have used a one-liner but thought it would obscure the code ;)

该解决方案非常适用于增量的情况下(例如,当您匹配的用户类型,从而保持重建),因为当用户添加,你可以简单地重新过滤结果你只计算一个字符。因此:

This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:

  • 无论有几个字符,从而匹配的是快速和列表的长度没有多大关系
  • 要么也有许多字符,这意味着我们正在筛选候选名单,因此它没有太大的关系,如果匹配需要更长的时间逐元素

我也应该注意到,这种常规的前pression不涉及回溯,因此十分有效。它也可以被建模为一个简单的状态机。

I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.

这篇关于算法"模糊匹配"字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆