比较不匹配正则表达式的速度 [英] Comparing speed of non-matching regexp

查看:43
本文介绍了比较不匹配正则表达式的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下 Python 代码非常慢:

The following Python code is incredibly slow:

import re
re.match( '([a]+)+c', 'a' * 30 + 'b' )

如果用更大的常数替换 30,情况会变得更糟.

and it gets worse if you replace 30 with a larger constant.

我怀疑由于连续的 + 导致的解析歧义是罪魁祸首,但我在正则表达式解析和匹配方面不是很专业.这是 Python 正则表达式引擎的错误,还是任何合理的实现都会这样做?

I suspect that the parsing ambiguity due to the consecutive + is the culprit, but I'm not very expert in regexp parsing and matching. Is this a bug of the Python regexp engine, or any reasonable implementation will do the same?

我不是 Perl 专家,但以下返回速度相当快

I'm not a Perl expert, but the following returns quite fast

perl -e '$s="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"; print "ok\n" if $s =~ m/([a]+)+c/;'

并且增加 'a' 的数量不会显着改变执行速度.

and increasing the number of 'a' does not alter substantially the execution speed.

推荐答案

我认为 Perl 足够聪明,可以将两个 + 合并为一个,而 Python 则不然.现在让我们想象一下引擎会做什么,如果这没有被优化掉.请记住,捕获通常很昂贵.还要注意,两个 + 都是贪婪的,因此引擎将尝试在一个回溯步骤中使用尽可能多的重复.每个项目符号代表一个回溯步骤:

I assume that Perl is clever enough to collapse the two +s into one, while Python is not. Now let's imagine what the engine does, if this is not optimized away. And remember that capturing is generally expensive. Note also, that both +s are greedy, so the engine will try to use as many repetitions as possible in one backtracking step. Each bullet point represents one backtracking step:

  • 引擎使用尽可能多的[a],并消耗所有三十个a.然后它不能再进一步了,所以它留下了第一次重复并捕获 30 a s.现在下一次重复开始,它试图用另一个 ([a]+) 消耗更多,但这当然不起作用.然后 c 无法匹配 b.
  • 回溯!丢弃内部重复消耗的最后一个 a .在此之后,我们再次离开内部重复,因此引擎将捕获 29 a s.然后另一个 + 开始工作,再次尝试内部重复(消耗第 30 个 a).然后我们再次离开内部重复,这也离开了捕获组,所以第一个捕获被丢弃,引擎捕获最后一个a.c 无法匹配 b.
  • 回溯!扔掉另一个 a 里面.我们捕获 28 个a.捕获组的第二个(外部重复)消耗最后 2 个 a捕获.c 无法匹配 b.
  • 回溯!现在我们可以在第二个重复中回溯并丢弃两个 a 中的第二个.剩下的将被捕获.然后我们第三次进入捕获组,捕获最后一个a.c 无法匹配 b.
  • 回溯!在第一次重复中减少到 27 个 a.
  • The engine uses as many [a] as possible, and consumes all thirty as. Then it can not go any further, so it leaves the first repetition and captures 30 as. Now the next repetition is on and it tries to consume some more with another ([a]+) but that doesn't work of course. And then the c fails to match b.
  • Backtrack! Throw away the last a consumed by the inner repetition. After this we leave the inner repetition again, so the engine will capture 29 as. Then the other + kicks in, the inner repetition is tried out again (consuming the 30th a). Then we leave the inner repetition once again, which also leaves the capturing group, so the first capture is thrown away and the engine captures the last a. c fails to match b.
  • Backtrack! Throw away another a inside. We capture 28 as. The second (outer repetition) of the capturing group consumes the last 2 as which are captured. c fails to match b.
  • Backtrack! Now we can backtrack in the second other repetition and throw away the second of two as. The one that is left will be captured. Then we enter the capturing group for the third time and capture the last a. c fails to match b.
  • Backtrack! Down to 27 as in the first repetition.

这是一个简单的可视化.每行代表一个回溯步骤,每组括号表示一次内部重复的消耗.大括号代表那些为该回溯步骤捕获的括号,而在此特定回溯步骤中不会重新访问正常括号.我省略了 b/c 因为它永远不会匹配:

Here is a simple visualization. Each line represents one backtracking step, and each set of parentheses shows one consumption of the inner repetition. The curly brackets represent those that are newly captured for that step of backtracking, while normal parentheses are not revisited in this particular backtracking step. And I leave out the b/c because it will never be matched:

{aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
{aaaaaaaaaaaaaaaaaaaaaaaaaaaaa}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaaaa}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa){a}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaaa}{aaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa){aa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa){a}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a){a}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaa}{aaaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){aaa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){aa}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa){a}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){a}{aaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a){aa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a){a}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a)(a){a}{a}

还有.所以.

请注意,最后引擎还将尝试 a 子集的所有组合(仅通过前 29 个 a 进行回溯,然后通过前 28 个 as) 只是为了发现,c 也不匹配 a.

Note that in the end the engine will also try all combinations for subsets of a (backtracking just through the first 29 as then through the first 28 as) just to discover, that c does also not match a.

正则表达式引擎内部的解释基于散布在 regular-expressions.info 周围的信息.

The explanation of regex engine internals is based on information scattered around regular-expressions.info.

为了解决这个问题.只需删除 + 之一.r'a+c' 或者如果您确实想要捕获 a 的数量,请使用 r'(a+)s'.

To solve this. Simply remove one of the +s. Either r'a+c' or if you do want to capture the amount of as use r'(a+)s'.

最后回答你的问题.我不会认为这是 Python 正则表达式引擎中的错误,而只是(如果有的话)缺乏优化逻辑.这个问题通常无法解决,因此引擎假设您必须自己处理灾难性的回溯并不太不合理.如果 Perl 足够聪明,能够识别足够简单的情况,那就更好了.

Finally, to answer your question. I would not consider this a bug in Python's regex engine, but only (if anything) a lack in optimization logic. This problem is not generally solvable, so it is not too unreasonably for an engine to assume, that you have to take care of catastrophic backtracking yourself. If Perl is clever enough to recognize sufficiently simple cases of it, so much the better.

这篇关于比较不匹配正则表达式的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆