可以处理机器生成的正则表达式的正则表达式实现:*非回溯*,O(n)? [英] Regex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?

查看:91
本文介绍了可以处理机器生成的正则表达式的正则表达式实现:*非回溯*,O(n)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑2: 要获得有关为何仍然如此重要的实际演示,请参考 stackoverflow自己的正则表达式今天造成的中断(2016-07-20)

编辑: 自从我第一次提出这个问题以来,这个问题已经有了很大的发展。请参阅下面的两个快速+兼容但功能不完全的实现。如果您知道更多或更好的实施方式,请提及它们,此处仍然没有理想的实施方式!

This question has considerably evolved since I first asked it. See below for two fast+compatible, but not completely fully featured implementations. If you know of more or better implementations, please mention them, there still isn't an ideal implementation here yet!

有人知道吗一个正常的 non-backtracking System.Text.RegularExpressions backtracks)线性时间正则表达式实现是.NET还是本机的,并且可以从.NET合理使用?为了有用,它需要:

Does anyone know of a normal non-backtracking (System.Text.RegularExpressions backtracks) linear time regex implementation either for .NET or native and reasonably usable from .NET? To be useful, it would need to:


  • 具有最坏的情况的正则​​表达式评估时间复杂度 O(m * n),其中m是正则表达式的长度,n是输入的长度。

  • 具有正常的时间复杂度O(n),因为几乎没有正则表达式实际上会触发指数状态空间,或者,如果可能的话,只会触发输入的一分钟子集。

  • 具有合理的构建速度(即没有潜在的指数DFA)

  • 供人类使用,而不是数学家使用-例如我不想重新实现unicode字符类: .NET或PCRE样式字符类是一个加号。

  • have a worst case time-complexity of regex evaluation of O(m*n) where m is the length of the regex, and n the length of the input.
  • have a normal time-complexity of O(n), since almost no regular expressions actually trigger the exponential state-space, or, if they can, only do so on a minute subset of the input.
  • have a reasonable construction speed (i.e. no potentially exponential DFA's)
  • be intended for use by human beings, not mathematicians - e.g. I don't want to reimplement unicode character classes: .NET or PCRE style character classes are a plus.

  • 奖励积分,如果它实现了基于堆栈的功能,则让它实用它会处理嵌套,但要消耗O(n + m)内存而不是O(m)内存。

  • 两个 的奖励积分 捕获子表达式 的替换项(如果可能的子表达式匹配项呈指数形式,则本质上枚举所有 指数-但是不应该列举前几个,并且类似地用于替换)。您可以通过使用另一个功能来解决缺少任何一个功能的问题,因此拥有一个功能就足够了。

  • 很多将正则表达式作为头等值处理的奖励积分(因此您可以将并集,交集,串联,求反-尤其是求反和交集,因为通过正则表达式定义的字符串操作很难做到这些)

  • 惰性匹配即匹配无限流而不将其全部存储在内存中是一个加号。如果数据流不支持查找,则一次捕获一般就不可能捕获子表达式和/或替换。

  • 没有反向引用,它们根本上是不可靠的;即,在给定病理输入情况下,总是可以表现出指数行为。

  • bonus points for practicality if it implements stack-based features which let it handle nesting at the expense of consuming O(n+m) memory rather than O(m) memory.
  • bonus points for either capturing subexpressions or replacements (if there are an exponential number of possible subexpression matches, then enumerating all of them is inherently exponential - but enumerating the first few shouldn't be, and similarly for replacements). You can workaround missing either feature by using the other, so having either one is sufficient.
  • lotsa bonus points for treating regexes as first class values (so you can take the union, intersection, concatenation, negation - in particular negation and intersection as those are very hard to do by string manipulation of the regex definition)
  • lazy matching i.e. matching on unlimited streams without putting it all in memory is a plus. If the streams don't support seeking, capturing subexpressions and/or replacements aren't (in general) possible in a single pass.
  • Backreferences are out, they are fundamentally unreliable; i.e. can always exhibit exponential behavior given pathological input cases.

存在这种算法(这是基本的自动机理论...)-但是是否可以从.NET访问任何实际上可用的实现

Such algorithms exist (This is basic automata theory...) - but are there any practically usable implementations accessible from .NET?

我喜欢使用Regex进行快速而肮脏的文本清理,但是我反复遇到一些问题,其中perl / java / python / .NET使用的常见回溯NFA实现显示指数行为。不幸的是,一旦您开始自动生成正则表达式,这些情况就很容易触发。当您在匹配相同前缀的正则表达式之间交替时,甚至非指数性能也可能变得非常差-例如,在一个非常基本的示例中,如果您使用字典并将其转换为正则表达式,则性能会很差。

I like using Regex's for quick and dirty text clean-ups, but I've repeatedly run into issues where the common backtracking NFA implemtation used by perl/java/python/.NET shows exponential behavior. These cases are unfortunately rather easy to trigger as soon as you start automatically generating your regular expressions. Even non-exponential performance can become exceedingly poor when you alternate between regexes that match the same prefix - for instance, in a really basic example, if you take a dictionary and turn it into a regular expression, expect terrible performance.

有关60年代以来为什么存在更好的实现的简要概述,请参见正则表达式匹配可以既简单又快速

For a quick overview of why better implementations exist and have since the 60s, see Regular Expression Matching Can Be Simple And Fast.


  • 几乎是理想的 FSA工具包 。可以将regexes编译为DFA + NFA的快速C实现,也可以允许换能器(!),具有一流的regexes(封装yy!),包括用于交集和参数化的语法。 但是它在序言中 ...(为什么某些实用功能无法用主流语言提供?)

  • 快速但不切实际的:完整的解析器,例如出色的 ANTLR 通常支持可靠的快速正则表达式。但是,antlr的语法要冗长得多,当然允许的结构可能不会生成有效的解析器,因此您需要找到一些安全的子集。

  • Almost ideal: FSA toolkit. Can compile regexes to fast C implementations of DFA's+NFA's, allows transducers(!) too, has first class regexes (encapsulation yay!) including syntax for intersection and parametrization. But it's in prolog... (why is something with this kind of practical features not available in a mainstream language???)
  • Fast but impractical: a full parser, such as the excellent ANTLR generally supports reliably fast regexes. However, antlr's syntax is far more verbose, and of course permits constructs that may not generate valid parsers, so you'd need to find some safe subset.

  • RE2 -一个Google开源库,旨在实现合理的PCRE兼容性减去反向引用。根据作者的说法,我认为这是plan9的regex lib unix端口的继承人。

  • TRE -也与PCRE兼容,甚至可以进行反向引用,尽管使用那些会失去速度保证的方法。

  • RE2 - a google open source library aiming for reasonable PCRE compatibility minus backreferences. I think this is the successor to the unix port of plan9's regex lib, given the author.
  • TRE - also mostly compatible with PCRE and even does backreferences, although using those you lose speed guarantees. And it has a mega-nifty approximate matching mode!

不幸的是,这两种实现都是C ++,需要从.NET使用互操作。 / p>

Unfortunately both implementations are C++ and would require interop to use from .NET.

推荐答案

首先,您的建议是可能的,并且您当然知道您的主题。您还知道不使用反向引用实现的代价就是内存。如果您足够控制环境,那么这可能是一种合理的方法。

First, what your suggesting is possible and you certainly know your subject. You also know that the trade-off of not using back-referencing implementations is memory. If you control your environment enough this is likely a reasonable approach.

在继续之前,我唯一要评论的是,我鼓励您质疑使用RegEx的选择。您显然对特定的问题以及要解决的问题更加熟悉,因此只有您才能回答问题。我认为ANTLR不会是一个不错的选择。但是,自制规则引擎(如果范围有限)可以针对您的特定需求进行高度调整。这一切都取决于您的特定问题。

The only thing I will comment on before continuing is that I would encourage you to question the choice of using RegEx. You are clearly more familiar with your specific problem and what your trying to solve so only you can answer the question. I don't think ANTLR would be a good alternative; however, A home-brew rules engine (if limited in scope) can be highly tuned to your specific needs. It all depends on your specific problem.

对于那些阅读此书并错过重点的人来说,这里是一些背景知识:

For those reading this and 'missing the point', here is some background reading:

  • Regular Expression Matching Can Be Simple And Fast

在同一站点上,有许多实现在此页面上链接

From the same site, there are a number of implementations linked on this page.

以上文章的整个讨论的要点是,最好的答案是同时使用两者。为此,我知道唯一被广泛使用的实现是TCL语言使用的实现。据我了解,它最初是由Henry Spencer编写的,并且采用了这种混合方法。尽管我不知道有任何广泛使用的方法,但曾尝试过将其移植到c库中。沃尔特·沃尔多(Walter Waldo)和托马斯·拉克纳(Thomas Lackner)均已提及,​​并在此处链接。还提到了 boost库我不确定执行情况。您还可以查看TCL代码本身(从其站点链接到)并从那里开始工作。

The gist of the entire discussion of the above article is that the best answer is to use both. To that end, the only widely used implementation I'm aware of is the one used by the TCL language. As I understand it was originally written by Henry Spencer and it employs this hybrid approach. There have been a few attempts at porting it to a c library, though I'm not aware of any that are in wide use. Walter Waldo's and Thomas Lackner's are both mentioned and linked here. Also mentioned is the boost library though I'm not sure of the implementation. You can also look at the TCL code itself (linked from their site) and work from there.

简而言之,我会选择 TRE 计划9 ,因为它们都受到积极支持。

In short, I'd go with TRE or Plan 9 as these are both actively supported.

显然,这些都不是C#/。Net,我不知道是哪个。

Obviously none of these are C#/.Net and I'm not aware of one that is.

这篇关于可以处理机器生成的正则表达式的正则表达式实现:*非回溯*,O(n)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆