为什么我简单的Ragel语法会使用所有内存和崩溃 [英] Why my simple Ragel grammar use all memory and crash

查看:181
本文介绍了为什么我简单的Ragel语法会使用所有内存和崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将Adblock Plus规则中的一组正则表达式转换为可以从C ++调用的优化函数.

I am trying to convert a set of regular expression from Adblock Plus rules into an optimized function I could call from C++.

我原本希望能够使用诸如Ragel之类的词法分析器生成器来执行此操作,但是当我尝试使用很少量的Regex集时,内存使用率会非常高> 30 GB,并且Ragel退出时不会出现错误消息,也不会产生输出文件.

I was expecting to be able to use a lexer generator such as Ragel to do this but when I try with a very small set of Regex the memory usage go very high > 30 GB and Ragel quit without error message and without producing the output file.

我包括了玩具语法,下面我想了解我是否在做一些可以优化以解决问题的愚蠢行为.

I included the toy grammar bellow, I am trying to understand if I am doing anything stupid that could be optimized to solve the issue.

#include <string.h>
namespace xab{
 %%{
machine lexer;
WILDCARD = /[A-Za-z0-9;\/\?:@=&$_\.\+!\*'~#^,%:\-]/*;
SUBDOMAIN = /([A-Za-z]([A-Za-z0-9\-]*[A-Za-z0-9])?\.)+/;
SEPERATOR = /[:\/\?=&]/;
main := 
(WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD) |
(WILDCARD '.a3s?n=' WILDCARD '&zone_id=' WILDCARD) |
(WILDCARD '/addyn|' WILDCARD ';adtech;' WILDCARD) |
(WILDCARD '/addyn|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/adiframe|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/adserv|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/affiliates.' WILDCARD '.aspx?' WILDCARD) |
(WILDCARD '/affiliates/' WILDCARD '/show_banner.' WILDCARD) |
(WILDCARD '/banner_js.' WILDCARD '?' WILDCARD) |
(WILDCARD '/bannerframe.' WILDCARD '?' WILDCARD) |
(WILDCARD '/banners.' WILDCARD '&iframe=' WILDCARD) |
(WILDCARD '/bannerview.' WILDCARD '?' WILDCARD) |
(WILDCARD '/bannery/' WILDCARD '?banner=' WILDCARD) |
(WILDCARD '/cdn-cgi/pe/bag?r[]=' WILDCARD 'cpalead.com' WILDCARD) |
(WILDCARD '/delivery/' WILDCARD '?advplaces=' WILDCARD) |
(WILDCARD '/eas?camp=' WILDCARD ';cre=' WILDCARD) |
(WILDCARD '/eas?cu=' WILDCARD ';cre=' WILDCARD) |
(WILDCARD '/eas?cu=' WILDCARD ';ord=' WILDCARD) |
(WILDCARD '/ireel/ad' WILDCARD '.jpg' WILDCARD) |
(WILDCARD '/is.php?ipua_id=' WILDCARD '&search_id=' WILDCARD);
write data; 
}%%
bool matchBlacklist(const char *data) {
const char *p = data;

const char *pe = data + strlen(data);
int cs;
//write init
%% write init; 
// write exec
%% write exec;
if (cs >= lexer_first_final)
return true;
return false;
}
}

推荐答案

AFAIK,您正在面对"DFA空间爆炸".

AFAIK, you're facing the "DFA space explosion".

DFA必须在字符串的一遍中匹配您的所有规则.为此,每个状态都需要从1)过渡到每个规则的开头,以及2)过渡到每个交织规则的中间.

DFA has to match all your rules in one pass over the string. To do that every state needs a transition 1) to the beginning of every rule and 2) to the middle of every interleaving rule.

此外,WILDCARD可能正在创建不确定行为",因为例如,在WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD规则中,WILDCARD将与&prvtof=相匹配.这以及WILDCARD中大量的选项可能会进一步破坏DFA.

Further, the WILDCARD might be creating "nondeterministic behaviour" because, for example, in the WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD rule the WILDCARD will match &prvtof=. This, and the sheer amount of options in the WILDCARD might further explode the DFA.

Ragel 6.8手册中,有有关如何在"2.5.5串联"和"4.控制不确定性"中简化DFA.

In the Ragel 6.8 manual there are guidelines on how to simplify the DFA in the sections "2.5.5 Concatenation" and "4. Controlling nondeterminism".

为避免"DFA空间爆炸",您可能需要使用WILDCARD,将其替换为any.

To avoid the "DFA space explosion" you might want to kind of "deoptimize" the Ragel machine by using scanners, thus selectively switching from a "stateless" DFA to backtracking. And you might want to reduce nondeterminism by using the strong difference operator. And you might want to simplify the WILDCARD, replacing it with any.

action matched {return true;}
main := |*
  '&prvtof=' (any* -- '&poru=') '&poru=' => matched;
  '.a3s?n=' (any* -- '&zone_id=') '&zone_id=' => matched;
  any;
*|

这篇关于为什么我简单的Ragel语法会使用所有内存和崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆