为什么正则表达式可以有指数级的运行时间? [英] Why can regular expressions have an exponential running time?

查看:54
本文介绍了为什么正则表达式可以有指数级的运行时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以编写在某些情况下需要指数运行时间的正则表达式.这样的例子是(aa|aa)*.如果有奇数个a的输入,它需要指数运行时间.

It is possible to write a Regex which needs in some cases exponential running time. Such an example is (aa|aa)*. If there is an input of an odd number of as it needs exponential running time.

测试这个很容易.如果输入仅包含 as 并且长度为 51,则 Regex 需要几秒钟来计算(在我的机器上).相反,如果输入长度为 52,则其计算时间并不明显(我使用 JavaRE 的内置正则表达式解析器对此进行了测试).

It is easy to test this. If the input contains only as and has length 51, the Regex needs some seconds to compute (on my machine). Instead if the input length is 52 its computing time is not noticeable (I tested this with the built-in Regex-parser of the JavaRE).

我写了一个正则表达式解析器来找出这种行为的原因,但我没有找到.我的解析器可以构建一个 AST 或一个 NFA 基于正则表达式.之后,它可以将 NFA 转换为 DFA.为此,它使用 powerset 构造算法.

I have written a Regex-parser to find the reason for this behavior, but I didn't find it. My parser can build an AST or a NFA based on a Regex. After that it can translate the NFA to a DFA. To do this it uses the powerset construction algorithm.

当我解析上面提到的 Rgex 时,解析器创建了一个具有 7 个状态的 NFA - 转换后 DFA 中只剩下 3 个状态.DFA 表示更合理的 Regex (aa)*,可以非常快速地解析.

When I parse the Rgex mentioned above, the parser creates a NFA with 7 states - after conversion there are only 3 states left in the DFA. The DFA represents the more sensible Regex (aa)*, which can be parsed very fast.

因此,我不明白为什么解析器会这么慢.这是什么原因?他们不把 NFA 翻译成 DFA 吗?如果是,为什么不呢?他们计算如此缓慢的技术原因是什么?

Thus, I don't understand why there are parsers which can be so slow. What is the reason for this? Do they not translate the NFA to a DFA? If yes, why not? And what's the technical reasons why they compute so slow?

推荐答案

Russ Cox 有一个非常关于为什么会这样以及正则表达式的历史的详细文章(第 2 部分, 第 3 部分).

正则表达式匹配可以简单而快速,使用已经为人所知的基于有限自动机的技术.相比之下,Perl、PCRE、Python、Ruby、Java 和许多其他语言都有基于递归回溯的正则表达式实现,这些实现很简单,但速度可能非常慢.除了反向引用之外,慢速回溯实现提供的功能可以由基于自动机的实现以更快、更一致的速度提供.

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.

在很大程度上,这归结为正则"表达式中的非正则特性的激增,例如反向引用,以及大多数程序员(继续)无知,不包含这些特性的正则表达式有更好的替代品(这是很多其中).

Largely, it comes down to proliferation of non-regular features in "regular" expressions such as backreferences, and the (continued) ignorance of most programmers that there are better alternatives for regexes that do not contain such features (which is many of them).

在 1980 年代初期编写文本编辑器 sam 时,Rob Pike 编写了一个新的正则表达式实现,Dave Presotto 将其提取到一个库中,该库出现在第八版中.Pike 的实现将子匹配跟踪合并到了高效的 NFA 模拟中,但与第八版源的其余部分一样,并未广泛分发.派克本人并没有意识到他的技术是什么新东西.Henry Spencer 从头开始​​重新实现第八版库接口,但使用回溯,并将他的实现发布到公共领域.它变得非常广泛使用,最终成为前面提到的慢速正则表达式实现的基础:Perl、PCRE、Python 等.(在他的辩护中,Spencer 知道这些例程可能很慢,但他不知道存在更有效的算法.他甚至在文档中警告说,许多用户发现速度非常合适,尽管将 egrep 的内部替换为这段代码将是一个错误.") Pike 的正则表达式实现,扩展到支持 Unicode,在 1992 年底与 sam 一起免费提供,但特别有效的正则表达式搜索算法没有被注意到.

While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, "Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.") Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.

这篇关于为什么正则表达式可以有指数级的运行时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆