回溯正则表达式实现的优化技术 [英] Optimization techniques for backtracking regex implementations

查看:103
本文介绍了回溯正则表达式实现的优化技术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于探索Ruby的正则表达式算法.编译后的正则表达式将转换为虚拟机命令数组;为了回溯,当前命令和输入字符串索引以及捕获组信息都保留在堆栈中.

I'm trying to implement a regular expression matcher based on the backtracking approach sketched in Exploring Ruby’s Regular Expression Algorithm. The compiled regex is translated into an array of virtual machine commands; for the backtracking the current command and input string indices as well as capturing group information is maintained on a stack.

正则表达式匹配:虚拟机方法中,Cox提供了更详细的信息关于如何将某些正则表达式组件编译为VM命令的内容,尽管所讨论的实现有所不同.根据这些文章,我的实现对于标准分组,字符类和重复组件非常有效.

In Regular Expression Matching: the Virtual Machine Approach Cox gives more detailed information about how to compile certain regex components into VM commands, though the discussed implementations are a bit different. Based on thoese articles my implementation works quite well for the standard grouping, character classes and repetition components.

现在,我想看看这种类型的实现有哪些扩展和优化选项. Cox在他的文章中提供了许多有关DFA/NFA方法的有用信息,但是有关回溯方法的扩展或优化技术的信息却很少.

Now I would like to see what extensions and optimization options there are for this type of implementation. Cox gives in his article a lot of useful information on the DFA/NFA approach, but the information about extensions or optimization techniques for the backtracking approach is a bit sparse.

例如,关于他声明的反向引用

For example, about backreferences he states

回溯引用在回溯实现中很简单.

Backreferences are trivial in backtracking implementations.

并给出了DFA方法的想法.但是对我来说,尚不清楚如何通过VM方法轻松地"完成此操作.到达backreference命令时,您必须将来自相应组的先前匹配的字符串编译为另一组VM命令,并以某种方式将这些命令合并到当前VM中,或者维护第二个VM并暂时将执行切换到该VM.

and gives an idea for the DFA approach. But it's not clear to me how this can be "trivially" done with the VM approach. When the backreference command is reached, you'd have to compile the previously matched string from the corresponding group into another list of VM commands and somehow incorporate those commands into the current VM, or maintain a second VM and switch execution temporarily to that one.

他还提到了通过使用超前进行重复的一种可能的优化,但是没有详细说明如何实现.在我看来,这可以用来减少回溯堆栈上的项目数量.

He also mentions a possible optimization in repetitions by using look-aheads, but doesn't elaborate on how that would work. It seems to me this could be used to reduce the number items on the backtracking stack.

tl; dr 对于基于VM的回溯正则表达式实现,存在哪些常规优化技术?它们如何工作?请注意,我并不是在寻找特定于某种编程语言的优化,而是针对这种正则表达式实现的通用技术.

tl;dr What general optimization techniques exist for VM-based backtracking regex implementations and how do they work? Note that I'm not looking for optimizations specific to a certain programming language, but general techniques for this type of regex implemenations.

如第一个链接所述, Oniguruma库正是使用基于堆栈的回溯方法实现了正则表达式匹配器.也许有人可以解释该库所做的优化,这些优化可以推广到其他实现中.不幸的是,该库似乎未提供有关源代码的任何文档,并且该代码也缺少注释.

As mentioned in the first link, the Oniguruma library implements a regex matcher with exactly that stack-based backtracking approach. Perhaps someone can explain the optimizations done by that library which can be generalized to other implementations. Unfortunately, the library doesn't seem to provide any documentation on the source code and the code also lacks comments.

在阅读有关解析表达语法(PEG)的文章时,我偶然发现了

Edit 2: When reading about parsing expression grammars (PEGs), I stumbled upon a paper on a Lua PEG implementation which makes use of a similar VM-based approach. The paper mentions several optimization options to reduce the number of executed VM commands and an unnecessary growth of the backtracking stack.

推荐答案

我建议您观看完整的 lection a>,这很有趣,但是这里是概述:

I suggest you to watch full lection, it is very interesting, but here is outline:

  • 复杂性爆炸在回溯中.发生这种情况然后模式有 含糊不清(例如视频中的[a-x]*[a-x0-9]*z),因此引擎必须回溯并测试所有条件,直到确定模式匹配(或不匹配).

  • Complexity explosion in backtracking. This happens then pattern have ambiguity ([a-x]*[a-x0-9]*z in video, as an example) in it, so engine have to backtrack and test all conditions, until it become certain the pattern did (or didn't) match.

它可以占用O(Nᵖ),其中p是模式的歧义度量". 为了获得O(pN),我们需要避免一遍又一遍地评估等效线程.

It can take up to O(Nᵖ), where p is "measure of ambiguity" of pattern. To get O(pN), we need to avoid evaluating equivalent threads again and again.

...

解决方案: 一步将所有线程按一个字符对齐,广度优先"执行会导致线性复杂度.

Solution: At one step ajust all threads by one character, "Breadth-first" execution results in linear comlexity.

可以节省性能的技巧

希望这会有所帮助!

P.S Lector的存储库

P.S Lector's repository

这篇关于回溯正则表达式实现的优化技术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆