具有句点“."的弹性规则.没有编译 [英] Flex rule with a period "." is not compiling

查看:38
本文介绍了具有句点“."的弹性规则.没有编译的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在用flex编译此正则表达式时遇到了问题

"on"[ \t\r]*[.\n]{0,300}"."[ \t\r]*[.\n]{0,300}"from"    {counter++;}

我在flex规范文件的规则"部分中有100条规则.我尝试编译它flex -Ce -Ca rule.flex我等待了10个小时仍未完成,所以我将其杀死.我开始发现问题,并将问题缩小到此规则.如果我从100条规则中删除此规则,则需要21秒将其编译为C代码.

如果用其他字符替换句点,则编译成功.例如

"on"[ \t\r]*[.\n]{0,300}"A"[ \t\r]*[.\n]{0,300}"from"    {counter++;} 

立即编译.甚至在空格字符之前/之后的一个周期也可以快速编译

"on"[ \t\r]*[.\n]{0,300}" ."[ \t\r]*[.\n]{0,300}"from"    {counter++;}

我可以从 flex manual 表示."匹配文字.".

我的规则有什么问题?

解决方案

简单的答案是[.\n]可能没有按照您的想象做.在字符类内部,大多数元字符失去了特殊的含义,因此字符类仅包含两个字符:文字.和换行符.您应该使用(.|\n).

但这不能解决问题.

根本原因是使用固定的重复计数.如果匹配区域的末端不明确,则较大(甚至不是很大)的重复计数可能导致状态机呈指数级爆炸.

通过重复[.\n],重复的匹配具有明确的终止,除非正则表达式的其余部分可以以点或换行符开头.因此"."会触发问题,但"A"不会.如果您纠正重复以匹配任何字符,则随后的任何字符都会触发指数爆炸.因此,如果您进行了上述建议的更改,则正则表达式将继续无法编译.

将重复计数更改为无限重复(星号运算符)可以避免此问题.


为说明问题,我使用了-v选项来检查具有不同重复计数的状态数.这清楚地表明状态计数呈指数级增长,而且很明显不可能超过14次重复. (我没有显示时间消耗;足以说明flex的算法在DFA的大小上不是线性的,因此,每增加一次重复,状态数就会加倍,但时间消耗大约是四倍;在在16个状态下,flex花费了45秒,因此可以合理地假设大约需要一周的时间来进行23次重复,前提是所需要的6GB RAM无需过多交换就可以使用.(我没有尝试过该实验.)

$ cat badre.l
%%
"on"[ \t\r]*[.\n]{0,XXX}"."[ \t\r]*[.\n]{0,XXX}"from"
$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14; do
>   printf '{0,%d}:\t%24s\n' $i \
>      "$(flex -v -o /dev/null <( sed "s/XXX/$i/g" badre.l) |&
>         grep -o '.*DFA states')"
> done
{0,1}:        17/1000 DFA states
{0,2}:        25/1000 DFA states
{0,3}:        41/1000 DFA states
{0,4}:        73/1000 DFA states
{0,5}:       137/1000 DFA states
{0,6}:       265/1000 DFA states
{0,7}:       521/1000 DFA states
{0,8}:      1033/2000 DFA states
{0,9}:      2057/3000 DFA states
{0,10}:     4105/6000 DFA states
{0,11}:    8201/11000 DFA states
{0,12}:   16393/21000 DFA states
{0,13}:   32777/41000 DFA states
{0,14}:   65545/82000 DFA states

将正则表达式更改为对两个重复使用(.|\n)都会使状态数大约增加三倍,因为随着这种变化,两个重复都变得模棱两可(并且两者之间存在相互作用). /p>

I am facing a problem compiling this regular expression with flex

"on"[ \t\r]*[.\n]{0,300}"."[ \t\r]*[.\n]{0,300}"from"    {counter++;}

I had 100 hundred rules in rules section of flex specification file. I tried to compile it flex -Ce -Ca rule.flex I waited for 10 hours still it didn't complete so I killed it. I started to find the issue and narrowed down the problem to this rule. If I remove this rule from 100 rules, it takes 21 seconds to compile it to C code.

If I replace the period with some other character it compiles successfully. e.g.

"on"[ \t\r]*[.\n]{0,300}"A"[ \t\r]*[.\n]{0,300}"from"    {counter++;} 

compiles in no time. Even a period followed/preceded by a space character compiles quickly

"on"[ \t\r]*[.\n]{0,300}" ."[ \t\r]*[.\n]{0,300}"from"    {counter++;}

I can see from flex manual that "." matches literal "."

What is wrong with my rule?

解决方案

The simple answer is that [.\n] probably doesn't do what you think it does. Inside a character class, most metacharacters lose their special meaning, so that character class contains only two characters: a literal . and a newline. You should use (.|\n).

But that won't solve the problem.

The underlying cause is the use of a fixed repetition count. Large (or even not so large) repetition counts can result in exponential blow-up of the state machine, if the end of the matched region is ambiguous.

With the repetition of [.\n], the repeated match has an unambiguous termination unless the rest of the regex can start with a dot or a newline. So "." triggers the problem, but "A" doesn't. If you correct the repetition to match any character, then any following character will trigger exponential blow-up. So if you make the change suggested above, the regular expression will continue to be uncompilable.

Changing the repetition count to an indefinite repetition (the star operator) would avoid the problem.


To illustrate the problem, I used the -v option to check the number of states with different repetition counts. This clearly shows the exponential increase in state count, and it's obvious that going much further than 14 repetitions would be impossible. (I didn't show the time consumption; suffice it to say that flex's algorithms are not linear in the size of the DFA, so while each additional repetition doubles the number of states, it roughly quadruples the time consumption; at 16 states, flex took 45 seconds, so it's reasonable to assume that it would take about a week to do 23 repetitions, provided that the 6GB of RAM it would need was available without too much swapping. I didn't try the experiment.)

$ cat badre.l
%%
"on"[ \t\r]*[.\n]{0,XXX}"."[ \t\r]*[.\n]{0,XXX}"from"
$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14; do
>   printf '{0,%d}:\t%24s\n' $i \
>      "$(flex -v -o /dev/null <( sed "s/XXX/$i/g" badre.l) |&
>         grep -o '.*DFA states')"
> done
{0,1}:        17/1000 DFA states
{0,2}:        25/1000 DFA states
{0,3}:        41/1000 DFA states
{0,4}:        73/1000 DFA states
{0,5}:       137/1000 DFA states
{0,6}:       265/1000 DFA states
{0,7}:       521/1000 DFA states
{0,8}:      1033/2000 DFA states
{0,9}:      2057/3000 DFA states
{0,10}:     4105/6000 DFA states
{0,11}:    8201/11000 DFA states
{0,12}:   16393/21000 DFA states
{0,13}:   32777/41000 DFA states
{0,14}:   65545/82000 DFA states

Changing the regex to use (.|\n) for both repetitions roughly triples the number of states, because with that change both repetitions become ambiguous (and there is an interaction between the two of them).

这篇关于具有句点“."的弹性规则.没有编译的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆