文本信息与数千个正则表达式的高效匹配 [英] Efficient matching of text messages against thousands of regular expressions

查看:127
本文介绍了文本信息与数千个正则表达式的高效匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解决一个问题,我有一条短信可以与数千种形式的正则表达式匹配

I am solving a problem where I have text message to match with thousands of regular expressions of the form

<some string> {0 or 300 chars} <some string> {0 or 300 chars}

例如

"on"[ \t\r]*(.){0,300}"."[ \t\r]*(.){0,300}"from"

或者一个真实的例子可以是

or a real example can be

"Dear"[ \t\r]*"Customer,"[ \t\r]*"Your"[ \t\r]*"package"[ \t\r]*(.){0,80}[ \t\r]*"is"[ \t\r]*"out"[ \t\r]*"for"[ \t\r]*"delivery"[ \t\r]*"via"(.){0,80}[ \t\r]*"Courier,"[ \t\r]*(.){0,80}[ \t\r]*"on"(.){0,80}"."[ \t\r]*"Delivery"[ \t\r]*"will"[ \t\r]*"be"[ \t\r]*"attempted"[ \t\r]*"in"[ \t\r]*"5"[ \t\r]*"wkg"[ \t\r]*"days."

首先,我使用Java的正则表达式引擎.我一次将输入字符串与一个正则表达式匹配.这个过程太慢了.我发现Java的正则表达式引擎将正则表达式编译为NFA(非确定性有限自动机),由于灾难性的回溯,NFA会变慢.因此,我想到了使用flex-lexer将正则表达式转换为DFA(确定性有限自动机)的方法,以便将数百个正则表达式编译为一个DFA,因此我将在O(n)中获得匹配结果,n是输入字符串的长度.但是由于正则表达式中固定的重复计数,flex会花费大量时间来编译请参阅这里.

To start with I used Java's regular expression engine. I matched input string with one regular expression at a time. This process was too slow. I found that Java's regex engine compiles regular expression into NFA (Non Deterministic Finite Automata) which can get slow due to catastrophic backtracking. Therefore I thought of converting regular expressions into DFA (Deterministic Finite Automata) using flex-lexer so as to compile hundreds of regex into one DFA and thus I would get match result in O(n), n is the length of input string. But due to fixed repetition count in regex, flex is taking forever to compile see here.

可能是我做错了.有没有更好的方法可以做到这一点?我能想到的一种方法是将固定重复次数转换为不确定重复(星号运算符),如下所示

May be I am doing it all wrong. Is there any better way to do this? One way that I could think of is to convert fixed repetition count to indefinite repetition (the star operator) as follows

"on"[ \t\r]*(.)*"."[ \t\r]*(.)*"from"

此正则表达式可以毫无问题地进行编译,并且只需要几毫秒的时间.如果输入字符串与此规则匹配,我知道输入字符串中存在规则("on", "." and "from")中的常量字符串.现在iff flex支持命名的正则表达式组,我可以简单地计算这些组中的字符数并进行验证,但flex并非用于此目的.

This regex compiles without problem and it takes only milliseconds. If input string matches with this rule I know that constant strings from rule ("on", "." and "from") are present in input string. Now iff flex supported named group of regular expressions, I could simply count number of characters in those groups and verify but flex is not meant for this purpose.

问题-有什么方法可以有效解决此问题?

Question -- Is there any way to solve this problem efficiently?

推荐答案

问题是正则表达式的每个其他部分都是(.){0,80}:

The problem is every other part of the regex is (.){0,80}:

"Dear"[ \t\r]*"Customer,"[ \t\r]*"Your"[ \t\r]*"package"[ \t\r]*
(.){0,80}
[ \t\r]*"is"[ \t\r]*"out"[ \t\r]*"for"[ \t\r]*"delivery"[ \t\r]*"via"
(.){0,80}
[ \t\r]*"Courier,"[ \t\r]*
(.){0,80}
[ \t\r]*"on"
(.){0,80}"."
[ \t\r]*"Delivery"[ \t\r]*"will"[ \t\r]*"be"[ \t\r]*"attempted"[ \t\r]*"in"[ \t\r]*"5"[ \t\r]*"wkg"[ \t\r]*"days."

当下一个单词与最后一个单词不完全相同的80个字符时,正则表达式会变慢.它需要回溯以查看79是否会工作.或78.或77 ...不是全部或全部,(就像您似乎相信的那样; .{80}?为80或0个字符).

The regex is slow when the next word doesn't appear exactly 80 characters after the last one. It needs to backtrack to see if 79 would work. Or 78. Or 77... It's not all or nothing, (as you seem to believe it is; 80 or 0 characters would be .{80}?).

引擎更优化以应对.*,因为它更常见

The engine is simply more optimized to deal with .* because it is more common

根据事物在字符串中的位置,您可能可以通过懒惰的.{0,80}?获得更好的性能.但这不是一个很好的解决方案.

Depending on where things are in the string, you may be able to get better performance with the lazy .{0,80}?. But this isn't a great solution.

我认为这里的答案是使用多个正则表达式.

I think the answer here is to use more than one regex.

您可以找到匹配发生的索引,然后进行比较以比较它是否在此之前或之后之前的词组最先匹配的地方.

You can find the index the match happened at, and then compare compare it to see if it came before or after where the phrase before first matched.

事情变得更加复杂,可以在多个区域进行匹配,并且每次匹配的字符数不能超过x个.在这种情况下,您只需要收集多个匹配项并稍微改变一下数学即可.

It gets more complicated things can match in multiple areas AND you need for each match to be no more than x characters away. In that case, you would just need to collect multiple matches and change the math a little.

这篇关于文本信息与数千个正则表达式的高效匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆