模糊正则表达式 [英] Fuzzy Regular Expressions

查看:155
本文介绍了模糊正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的工作中,我获得了很好的结果,使用了近似的字符串匹配算法(例如Damerau–Levenshtein距离),使我的代码更不会受到拼写错误的影响.

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.

现在,我需要将字符串与简单的正则表达式(例如,TV Schedule for \d\d (Jan|Feb|Mar|...))进行匹配.这意味着字符串TV Schedule for 10 Jan应该返回0,而T Schedule for 10. Jan应该返回2.

Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2.

这可以通过在正则表达式中生成所有字符串(在本例中为100x12)并找到最佳匹配项来完成,但这并不实际.

This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.

您对如何有效地执行此操作有任何想法吗?

Do you have any ideas how to do this effectively?

推荐答案

我找到了 TRE库 ,似乎可以对正则表达式进行精确的模糊匹配.示例: http://hackerboss.com/approximate-regex-matching-in-python/ 它仅支持插入,删除和替换.没有换位.但我想这行得通.

I found the TRE library, which seems to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

我在以下文件上尝试了带有regexp的随附的agrep工具:

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

得到

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)$' filename
1:TV Schedule for 10Jan
8:TVSchedule for Jan 10
7:T Schedule for 10 Jan 2010
3:TV Schedule for 10 March
15:Tv plan for March

非常感谢您的所有建议.

Thanks a lot for all your suggestions.

这篇关于模糊正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆