为什么我简单的Ragel语法会使用所有内存和崩溃 [英] Why my simple Ragel grammar use all memory and crash
问题描述
我正在尝试将Adblock Plus规则中的一组正则表达式转换为可以从C ++调用的优化函数.
I am trying to convert a set of regular expression from Adblock Plus rules into an optimized function I could call from C++.
我原本希望能够使用诸如Ragel之类的词法分析器生成器来执行此操作,但是当我尝试使用很少量的Regex集时,内存使用率会非常高> 30 GB,并且Ragel退出时不会出现错误消息,也不会产生输出文件.
I was expecting to be able to use a lexer generator such as Ragel to do this but when I try with a very small set of Regex the memory usage go very high > 30 GB and Ragel quit without error message and without producing the output file.
我包括了玩具语法,下面我想了解我是否在做一些可以优化以解决问题的愚蠢行为.
I included the toy grammar bellow, I am trying to understand if I am doing anything stupid that could be optimized to solve the issue.
#include <string.h>
namespace xab{
%%{
machine lexer;
WILDCARD = /[A-Za-z0-9;\/\?:@=&$_\.\+!\*'~#^,%:\-]/*;
SUBDOMAIN = /([A-Za-z]([A-Za-z0-9\-]*[A-Za-z0-9])?\.)+/;
SEPERATOR = /[:\/\?=&]/;
main :=
(WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD) |
(WILDCARD '.a3s?n=' WILDCARD '&zone_id=' WILDCARD) |
(WILDCARD '/addyn|' WILDCARD ';adtech;' WILDCARD) |
(WILDCARD '/addyn|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/adiframe|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/adserv|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/affiliates.' WILDCARD '.aspx?' WILDCARD) |
(WILDCARD '/affiliates/' WILDCARD '/show_banner.' WILDCARD) |
(WILDCARD '/banner_js.' WILDCARD '?' WILDCARD) |
(WILDCARD '/bannerframe.' WILDCARD '?' WILDCARD) |
(WILDCARD '/banners.' WILDCARD '&iframe=' WILDCARD) |
(WILDCARD '/bannerview.' WILDCARD '?' WILDCARD) |
(WILDCARD '/bannery/' WILDCARD '?banner=' WILDCARD) |
(WILDCARD '/cdn-cgi/pe/bag?r[]=' WILDCARD 'cpalead.com' WILDCARD) |
(WILDCARD '/delivery/' WILDCARD '?advplaces=' WILDCARD) |
(WILDCARD '/eas?camp=' WILDCARD ';cre=' WILDCARD) |
(WILDCARD '/eas?cu=' WILDCARD ';cre=' WILDCARD) |
(WILDCARD '/eas?cu=' WILDCARD ';ord=' WILDCARD) |
(WILDCARD '/ireel/ad' WILDCARD '.jpg' WILDCARD) |
(WILDCARD '/is.php?ipua_id=' WILDCARD '&search_id=' WILDCARD);
write data;
}%%
bool matchBlacklist(const char *data) {
const char *p = data;
const char *pe = data + strlen(data);
int cs;
//write init
%% write init;
// write exec
%% write exec;
if (cs >= lexer_first_final)
return true;
return false;
}
}
推荐答案
AFAIK,您正在面对"DFA空间爆炸".
AFAIK, you're facing the "DFA space explosion".
DFA必须在字符串的一遍中匹配您的所有规则.为此,每个状态都需要从1)过渡到每个规则的开头,以及2)过渡到每个交织规则的中间.
DFA has to match all your rules in one pass over the string. To do that every state needs a transition 1) to the beginning of every rule and 2) to the middle of every interleaving rule.
此外,WILDCARD
可能正在创建不确定行为",因为例如,在WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD
规则中,WILDCARD
将与&prvtof=
相匹配.这以及WILDCARD
中大量的选项可能会进一步破坏DFA.
Further, the WILDCARD
might be creating "nondeterministic behaviour" because, for example, in the WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD
rule the WILDCARD
will match &prvtof=
. This, and the sheer amount of options in the WILDCARD
might further explode the DFA.
在 Ragel 6.8手册中,有有关如何在"2.5.5串联"和"4.控制不确定性"中简化DFA.
In the Ragel 6.8 manual there are guidelines on how to simplify the DFA in the sections "2.5.5 Concatenation" and "4. Controlling nondeterminism".
为避免"DFA空间爆炸",您可能需要使用WILDCARD,将其替换为any
.
To avoid the "DFA space explosion" you might want to kind of "deoptimize" the Ragel machine by using scanners, thus selectively switching from a "stateless" DFA to backtracking. And you might want to reduce nondeterminism by using the strong difference operator. And you might want to simplify the WILDCARD
, replacing it with any
.
action matched {return true;}
main := |*
'&prvtof=' (any* -- '&poru=') '&poru=' => matched;
'.a3s?n=' (any* -- '&zone_id=') '&zone_id=' => matched;
any;
*|
这篇关于为什么我简单的Ragel语法会使用所有内存和崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!