这个正则表达式不应该发生灾难性的回溯 [英] Catastrophic backtracking shouldn't be happening on this regex

查看:122
本文介绍了这个正则表达式不应该发生灾难性的回溯的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释为什么Java的正则表达式引擎会在这个正则表达式中进入灾难性的回溯模式吗?根据我的判断,每次交替都是互相排斥的。

  ^(?:[^'\ \\s〜:/ @#\\ | \\ ^ \\&安培; \\ [\\] \\(\\)\\ \\\\ {\\}] [^ \\\s〜:/ @#\\ | \\ ^ \\&安培; \\ [\\] \\(\\)\\\\\\ {\\}] * | 
\(?:[^] \\] + | \\)+ \|
'(?:[^'] + |'')+')

文字:'pãodeçacaraaucardmastercard platinum SUSTENTABILIDADE])



为一些替换添加占有匹配修复了问题,但我不知道为什么 - Java的正则表达式lib必须非常错误才能在互斥分支上回溯。

  ^(?:[^'\\\\\〜:/ @#\\ | \\ ^ \\& \\ \ [\\] \\(\\)\\\\\\ {\\}] [^ \\\s〜:/ @# \\ | \\ ^ \\&安培; \\ [\\] \\(\\)\\\\\\ {\\ \\\}] * | 
\(?:[^ \] ++ | \\)++ \|
'(?:[^' ] ++ |'')++')


解决方案

编辑:最后添加了Java版本 - 尽管本身就是笨拙,难以理解且无法维护。






¡没有更丑陋的模式!



首先你需要做的是以一种可以忍受的方式编写你的正则表达式任何可能的希望都是人类可读 - 并且因此可维护。
第二个你需要做的就是对其进行分析,看看它实际上在做什么。



这意味着至少需要在 Pattern.CO中编译它MMENTS 模式(或前缀(?x))然后添加空格以提供一些视觉肘部空间。尽可能接近,你实际上想要匹配的模式就是这个:

  ^ 
(?:[^'\ ~~:/ @ \#| ^& \ [\]()\\ {}] #alt 1:未加引号
[^ \〜:/ @ \#| ^& \ [\]()\\ {}] *
|(?:[^] + |)+ #alt 2:双引号
|'(?:[^'] + |'')+'#alt 3:单引号

如你所见,我在可能的地方引入了垂直和水平空白,以引导眼睛和心灵作为一种认知分块我也删除了所有无关的反斜杠。这些都是彻头彻尾的错误,或者是混淆器,除了混淆读者之外什么都不做。



注意在应用垂直空格时如何我已经在同一列中生成了从一行到下一行的相同部件,以便您可以立即看到哪些部件是相同的哪个部分不同。



这样做之后,我终于可以看到你在这里的一个匹配锚定到开始,然后选择三个选择。因此,我已经用描述性评论标记了这三个备选方案,因此无需猜测。



我还注意到你的第一个选择有两个微妙的不同(否定)方括号字符类。第二个是缺少第一个中的单引号排除。这是故意的吗?即使是这样,我发现这对我的口味来说太过重复了;部分或全部应该在一个变量中,这样你就不会冒更新一致性问题的风险。






分析



你需要做的两件事中的第二件也是更重要的是对此进行分析。您需要准确查看正在编译该模式的正则表达式程序,并且需要在运行数据时跟踪其执行情况。



Java的模式类目前无法做到这一点,虽然我已经与OraSun的当前代码保管人谈了很长时间,并且他都热衷于将此功能添加到Java并认为他知道如何做到这一点。他甚至给我发了一个原型,它完成了第一部分:汇编。所以我希望有一天可以使用它。



同时,让我们转而使用一种工具,其中正则表达式是编程语言的组成部分,不是作为一个尴尬的事后想法,一些东西咆哮着。虽然有几种语言符合这一标准,但在模式匹配方面没有达到Perl中所见的复杂程度。



这是一个等效的程序。

 #!/ usr / bin / env perl 
使用v5.10; #与占有匹配的第一个版本
使用utf8; #我们有utf8文字
使用严格; #require变量声明等
使用警告; #check for boneheadedness

my $ match = qr {
^(?:[^'\ ~~:/ @ \#| ^& \ [\]] ()\\ {}]
[^\ ~~:/ @ \#| ^& \ [\]()\\ {}] *
|(?:[^] + |)+
|'(?:[^'] + |'')+'

} x;

我的$ text ='pãodeaçúcaritaucard mastercard platinum SUSTENTABILIDAD]);

my $ count = 0;

while($ text = 〜/ $ match / g){
print得到它:$& \ n;
$ count ++;
}

if($ count = = 0){
打印匹配失败。\ n;
}

如果我们运行该程序,我们会得到匹配失败的预期答案。问题是为什么以及如何。



我们现在想看看两件事:我们想要查看该模式编译的正则表达式程序,然后我们想跟踪该正则表达式程序的执行情况。



这两个都是用

控制的

 使用重新调试; 

pragma,其中也可以通过 -Mre = debug 在命令行中指定。这就是我们在这里要做的,以避免黑客攻击源代码。



正则表达式编译



re 调试pragma通常会显示模式的编译及其执行。为了分离这些,我们可以使用Perl的仅编译开关, -c ,它不会尝试执行它编译的程序。这样我们所看到的就是编译模式。这些产生以下36行输出:

  $ perl -c -Mre = debug / tmp / bt 
编译REx%n ^(?:[^'%\ ~~:/ @ \#| ^& \ [\]()\\ {}]%n [^%\\ \\ ~~:/...
最终节目:
1:BOL(2)
2:BRANCH(26)
3:ANYOF [^ \ x09 \ x0a \x0c \ x0d#& - )/:@ [ - \ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl](14)
14:STAR(79)
15:ANYOF [^ \ x09 \ x0a \ x0c \ x0d#&()/:@ [ - \ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl]( 0)
26:分支(失败)
27:TRIE-EXACT ['](79)
<> (29)
29:CURLYX [0] {1,32767}(49)
31:分支(44)
32:PLUS(48)
33:ANYOF [\\ \\ x00 - !# - \ xfff] [{unicode_all}](0)
44:BRANCH(FAIL)
45:EXACT<> (48)
47:TAIL(48)
48:WHILEM [1/2](0)
49:NOTHING(50)
50:EXACT<> (79)
<'>(55)
55:CURLYX [0] {1,32767}(75)
57:BRANCH(70)
58:PLUS (74)
59:ANYOF [\ x00&( - \ xffff] [{unicode_all}](0)
70:BRANCH(FAIL)
71:EXACT< ''>(74)
73:TAIL(74)
74:WHILEM [2/2](0)
75:NOTHING(76)
76:EXACT< ;'>(79)
78:TAIL(79)
79:END(0)
锚定(BOL)minlen 1
/ tmp / bt语法OK
释放REx:%n ^(?:[^'%\ ~~:/ @ \#| ^& \ [\]()\\ {}]%n [^% \〜:/...

如您所见,已编译的正则表达式程序位于它本身就是一种正则表达式汇编语言。(它看起来非常类似于向我演示的Java原型吐出的内容,所以我想象一下有一天你也会在Java中看到这种事情。)所有细节对此都不是必不可少的,但我要指出,节点2的指令是BRANCH,如果失败则继续执行分支26,另一个分支。第二个BRANCH是正则表达式程序的唯一其他部分,由一个TRIE-EXACT节点组成,因为它知道备选方案具有不同的起始字符串。我们将在目前讨论的两个分支分支中发生什么。



正则表达式执行



现在现在是时候看看它运行时会发生什么。您正在使用的文本字符串会导致相当多的回溯,这意味着在最终失败之前,您将需要大量输出。多少输出?嗯,这么多:

  $ perl -Mre = debug / tmp / bt 2>& 1 | wc -l 
9987

我认为10,000步是你所说的灾难性的回溯模式。让我们看看我们不能把它简化为更易于理解的东西。您的输入字符串长度为61个字符。为了更好地了解发生的情况,我们可以将其减少到'pão,这只是4个字符。 (好吧,在NFC中,就是这样;它是NFD中的五个代码点,但这里没有任何改变)。这导致167行输出:

  $ perl -Mre = debug / tmp / bt 2>& 1 | wc -l 
167

实际上,这里是正则表达式的行(编译加)当你的字符串长这么多字符时你得到的执行分析:

 字符行字符串
1 63 <'>
2 78 <'p>
3 109 <'pã>
4 167 <'pão>
2 290 <'pão>
6 389 <'pãd >
7 487 <'pãode>
8 546 <'pãode>
9 615 <'pãodea>
10 722 <'pãodeça>
....
61 9987 <'pãodeaçúcaritaucard mastercard platinum SUSTENTABILIDAD])>

当字符串是四个字符'pão时,让我们看一下调试输出。我这次省略了编译部分只显示执行部分:

  $ perl -Mre = debug / tmp / bt 
匹配REx%n ^(?:[^'%\ ~~:/ @ \#| ^& \ [\]()\\ {}]%n [ ^%\s~:/...反对'p%x {e3} o
UTF-8字符串...
0<><'p%x {e3} o> | 1:BOL(2)
0<><'p%x {e3} o> | 2:BRANCH(26)
0<>< 'p%x {e3} o> | 3:ANYOF [^ \ x09 \ x0a \ x0c \ x0d#& - )/:@ [ - \ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl](14)
失败...
0<><'p%x {e3} o> | 26:BRANCH(78)
0< ;><'p%x {e3} o> | 27:TRIE-EXACT ['](79)
0<> < P%×{E3} O> |州:1接受:N Charid:2 CP:27 After State:3
1<'> <,P%×{E3} O> |状态:3接受:Y Charid:0 CP:0状态:0
得到1个可能的匹配
TRIE匹配单词#2,继续
只留下一个匹配,短路:#2 <'>
1<'> <,P%×{E3} O> | 55:CURLYX [0] {1,32767}(75)
1<'> <,P%×{E3} O> | 74:WHILEM [2/2](0)
whilem:匹配0中的1..32767
1<'> <,P%×{E3} O> | 57:分支(70)1< 1> <,P%×{E3} O> | 58:加号(74)
ANYOF [\ x00&( - \ xffff] [{unicode_all}]可以匹配2147483647中的3次...
5<'p%x {e3} o><> | 74:WHILEM [2/2](0)
whilem:匹配1中的1 ..32767
5<'p%x {e3} o> ;<> | 57:BRANCH(70)
5<'p%x {e3} o><> | 58:PLUS(74)
ANYOF [\ x00& ;( - \ xff] [{unicode_all}]可以匹配2147483647中的0次...
失败...
5<'p%x {e3} o><> | 70:BRANCH(73)
5<'p%x {e3} o><> | 71:EXACT<''>(74)
失败...
BRANCH失败...
whilem:失败,尝试继续...
5<'p%x {e3} o><> ; | 75:没什么(76)
5<'p%x {e3} o> <> | 76:EXACT<'>(79)
失败...
失败...
4<'p%x {e3}> < O> | 74:WHILEM [2/2](0)
whilem:匹配1中的1 .32767
4<'p%x {e3}> < O> | 57:分支(70)
4<'p%x {e3}> < O> | 58:加号(74)
ANYOF [\ x00&( - \ xff] [{unicode_all}]可以匹配2147483647中的1次...
5<'p%x {e3} o><> | 74:WHILEM [2/2](0)
whilem:匹配2中的1..32767
5<'p%x {e3} o> ;<> | 57:BRANCH(70)
5<'p%x {e3} o><> | 58:PLUS(74)
ANYOF [\ x00& ;( - \ xff] [{unicode_all}]可以匹配2147483647中的0次...
失败...
5<'p%x {e3} o><> | 70:BRANCH(73)
5<'p%x {e3} o><> | 71:EXACT<''>(74)
失败...
BRANCH失败了...
whilem:失败,尝试继续...
5<'p%x {e3} o> <> | 75:没有(76)
5<'p%x {e3} o> <> | 76:EXACT<'>(79)
失败...
失败...
失败...
4<'p%x {e3}> ; < O> | 70:分支(73)
4<'p%x {e3}> < O> | 71:EXACT<''>(74)
失败...
BRANCH失败...
whilem:失败,尝试继续...
4<' p%X {E3}> < O> | 75:没有(76)
4<'p%x {e3}> < O> | 76:EXACT<'>(79)
失败...
失败...
2<'p> <%×{E3} O> | 74:WHILEM [2/2](0)
whilem:匹配1中的1..32767
2<'p> <%×{E3} O> | 57:BRANCH(70)
2<'p> <%×{E3} O> | 58:加号(74)
ANYOF [\ x00&( - \ xff] [{unicode_all}]可以匹配2147483647中的2次...
5<'p%x {e3} o><> | 74:WHILEM [2/2](0)
whilem:匹配2中的1..32767
5<'p%x {e3} o> ;<> | 57:BRANCH(70)
5<'p%x {e3} o><> | 58:PLUS(74)
ANYOF [\ x00& ;( - \ xff] [{unicode_all}]可以匹配2147483647中的0次...
失败...
5<'p%x {e3} o><> | 70:BRANCH(73)
5<'p%x {e3} o><> | 71:EXACT<''>(74)
失败...
BRANCH失败了...
whilem:失败,尝试继续...
5<'p%x {e3} o> <> | 75:没有(76)
5<'p%x {e3} o> <> | 76:EXACT<'>(79)
失败...
失败...
4<'p%x {e3}> < O> | 74:WHILEM [2/2](0)
whilem:匹配2中的1..32767
4<'p%x {e3}> < O> | 57:分支(70)
4<'p%x {e3}> < O> | 58:加号(74)
ANYOF [\ x00&( - \ xff] [{unicode_all}]可以匹配2147483647中的1次...
5<'p%x {e3} o><> | 74:WHILEM [2/2](0)
whilem:匹配3中的3 ..32767
5<'p%x {e3} o> ;<> | 57:BRANCH(70)
5<'p%x {e3} o><> | 58:PLUS(74)
ANYOF [\ x00& ;( - \ xff] [{unicode_all}]可以匹配2147483647中的0次。
..
失败...
5<'p%x {e3} o> <> | 70:BRANCH(73)
5<'p%x {e3} o><> | 71:EXACT<''>(74)
失败。 ..
BRANCH失败了...
whilem:失败,尝试继续...
5<'p%x {e3} o> <> | 75:没有(76)
5<'p%x {e3} o> <> | 76:EXACT<'>(79)
失败...
失败...
失败...
4<'p%x {e3}> ; < O> | 70:分支(73)
4<'p%x {e3}> < O> | 71:EXACT<''>(74)
失败...
BRANCH失败...
whilem:失败,尝试继续...
4<' p%X {E3}> < O> | 75:没有(76)
4<'p%x {e3}> < O> | 76:EXACT<'>(79)
失败...
失败...
失败...
2<'p> <%×{E3} O> | 70:分支(73)
2<'p> <%×{E3} O> | 71:EXACT<''>(74)
失败...
BRANCH失败...
whilem:失败,尝试继续...
2<' p为H. <%×{E3} O> | 75:没什么(76)
2<'p> <%×{E3} O> | 76:EXACT<'>(79)
失败...
失败...
失败...
1<'> <,P%×{E3} O> | 70:分支(73)
1<'> <,P%×{E3} O> | 71:EXACT<''>(74)
失败...
BRANCH失败...
失败...
失败...
BRANCH失败...
匹配失败
匹配失败。
Freeing REx:%n ^(?:[^'%\ ~~:/ @ \#| ^& \ [\]()\\ {}]%n [^%\ s~:/...

你看到的是那里发生了什么trie快速分支到节点55,这是之后中的最后一个它匹配单引号,因为你的目标字符串以单引号开头。就是这个:

  |'(?:[^'] + |'')+'#alt 3:单引号

节点55是这个特里分支:

 <'>(55)
55:CURLYX [0] {1,32767}(75)
57:BRANCH(70)
58:PLUS(74)
59:ANYOF [\ x00&( - \ xffff] [{unicode_all}](0)
70:BRANCH(FAIL)
71:EXACT<''> (74)

这是执行跟踪,显示灾难性退避的发生地点:

  1<'>< p%x {e3} o> | 74:WHILEM [2/2](0)
whilem:匹配0中的1 .. 32767
1<'> <,P%×{E3} O> | 57:BRANCH(70)
1<'> <,P%×{E3} O> | 58:加号(74)
ANYOF [\ x00&( - \ xffff] [{unicode_all}]可以匹配2147483647中的3次...
5<'p%x {e3} o><> | 74:WHILEM [2/2](0)
whilem:匹配1中的1 ..32767
5<'p%x {e3} o> ;<> | 57:BRANCH(70)
5<'p%x {e3} o><> | 58:PLUS(74)
ANYOF [\ x00& ;( - \ xff] [{unicode_all}]可以匹配2147483647中的0次...
失败...
5<'p%x {e3} o><> | 70:BRANCH(73)
5<'p%x {e3} o><> | 71:EXACT<''>(74)
失败...
BRANCH失败...
whilem:失败,尝试继续...
5<'p%x {e3} o><> ; | 75:没什么(76)
5<'p%x {e3} o> <> | 76:EXACT<'>(79)
失败...
失败...
4<'p%x {e3}> < O> | 74:WHILEM [2/2](0)
whilem:匹配1中的1 .32767
4<'p%x {e3}> < O> | 57:分支(70)
4<'p%x {e3}> < O> | 58:加号(74)
ANYOF [\ x00&( - \ xff] [{unicode_all}]可以匹配2147483647中的1次...
5<'p%x {e3} o><> | 74:WHILEM [2/2](0)
whilem:匹配2中的1..32767

节点58吞噬了字符串中所有3个剩余字符,pão。这导致单个终止完全匹配引用失败。因此它尝试你的替代品,这是一对单引号,但也失败。



在这一点上,我必须质疑你的模式。不应该

 '(?:[^'] + |'')+'

真的很简单吗?

  '[^'] *'

所以发生的事情是有很多方法可以做到这一点回溯寻找从逻辑上永远不会发生的事情。你h ave一个嵌套的量词,这会造成各种无望和无意识的繁忙工作。



如果我们将模式简化为这样:

  ^(?:[^'\ ~~:/ @ \#| ^& \ [\]()\\ { }] + 
| [^] *
|'[^'] *'

现在,无论输入字符串的大小如何,它都会提供相同数量的跟踪输出行:只有40,包括编译。见完整字符串的编译和执行:

 编译REx%n ^(?:[^'%\\\〜:/ @ \#| ^& \ [\]( )\\ {}]%n [^%\ s~:/... 
最终节目:
1:BOL(2)
2:分支(26) )
3:ANYOF [^ \ x09 \ x0a \x0c \ x0d#& - )/:@ [ - \ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl ](14)
14:STAR(61)
15:ANYOF [^ \ x09 \ x0a \ x0c \ x0d#&()/:@ [ - \ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl](0)
26:BRANCH(FAIL)
27:TRIE-EXACT ['](61)
< >(29)
29:STAR(41)
30:ANYOF [\ x00 - !# - \ xff] [{unicode_all}](0)
41:EXACT< ;> (61)
<'> (46)
46:STAR(58)
47:ANYOF [\ x00&( - \ xffff] [{unicode_all}](0)
58:EXACT< '>(61)
60:TAIL(61)
61:END(0)
锚定(BOL)minlen 1
匹配REx%n ^(?:[ ^'%\s~:/ @ \#| ^& \ [\]()\\ {}]%n [^%\ s~:/...反对'p%x {e3} o de a%x {e7}%x {fa} car itaucard mast
ercard platinum S...
UTF-8 string ...
0<><'p%x {e3} o> | 1:BOL(2)
0<><'p%x {e3} o> | 2:BRANCH( 26)
0<><'p%x {e3} o> | 3:ANYOF [^ \ x09 \ x0a \ x0c \ x0d#& - )/:@ [-\ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl](14)
失败...
0<><'p%x {e3} o > | 26:BRANCH(60)
0<><'p%x {e3} o> | 27:TRIE-EXACT ['](61)
0< ><'p%x {e3} o> |状态:1接受:N Charid:2 CP:27 After State:3
1<'>&l t; p%x {e3} o d> |状态:3接受:Y Charid:0 CP:0状态后:0
得到1个可能的匹配
TRIE匹配单词#2,继续
只留下一个匹配,短路:#2<'>
1<'> < p%x {e3} o d> | 46:STAR(58)
ANYOF [\ x00&( - \ xffff] [{unicode_all}]可以匹配2147483647中的60次...
失败...
BRANCH失败...
匹配失败
匹配失败。
释放REx:%n ^(?:[^'%\ s~:/ @ \#| ^ & \ [\]()\\ {}]%n [^%\ s~:/...

我知道你认为占有性匹配可能就是这里的答案,但我认为真正的问题是原始模式中的错误逻辑。看看现在这种运作有多么明智? / p>

如果我们在旧模式上使用你的所有物运行它,即使我认为没有意义,我们仍然可以获得恒定的运行时间,但它需要更多的步骤。这种模式

  ^(?:[^'\ ~~:/ @ \#| ^& \ [\]()\\ {}] + #alt 1:不加引号
|(?:[^] ++ |)++#alt 2:双引号
|'(?:[^'] ++ |'')++'#alt 3:单引号

编译加执行配置文件如下:

 编译REx%n ^(?: [^'%\ ~~:/ @ \#| ^& \ [\]()\\ {}]%n [^%\ s~:/... 
最终节目:
1:BOL(2)
2:BRANCH(26)
3:ANYOF [^ \ x09 \ x0a \ x0c \ x0d# & - )/:@ [ - \ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl](14)
14:STAR(95)
15:ANYOF [^] \\ x09 \x0a \x0c \ x0d#&()/:@ [ - \ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl](0)
26:BRANCH (FAIL)
27:TRIE-EXACT ['](95)
<> (29)
29:暂停(58)
31:CURLYX [0] {1,32767}(55)
33:分支(50)
34:暂停(54) )
36:加号(48)
37:ANYOF [\ x00 - !# - \ xff] [{unicode_all}](0)
48:SUCCEED(0)
49:TAIL(53)
50:BRANCH(FAIL)
51:EXACT<> (54)
53:TAIL(54)
54:WHILEM [1/2](0)
55:NOTHING(56)
56:SUCCEED(0)
57:TAIL(58)
58:EXACT<>(95)
<'>(63)
63:SUSPEND(92)
65 :CURLYX [0] {1,32767}(89)
67:分支(84)
68:暂停(88)
70:加号(82)
71:ANYOF [\ x00&( - \ xffff] [{unicode_all}](0)
82:成功(0)
83:TAIL(87)
84:分支(失败) )
85:EXACT<''>(88)
87:TAIL(88)
88:WHILEM [2/2](0)
89:没什么( 90)
90:成功(0)
91:TAIL(92)
92:EXACT<'>(95)
94:TAIL(95)
95:END(0)
锚定(BOL)minlen 1
匹配REx%n ^(?:[^'%\ ~~:/ @ \#| ^& \\ \\ [\]()\\ {}]%n [^%\ s~:/...反对'p %x {e3} o de a%x {e7}%x {fa} car itaucard mastercard platinum S...
UTF-8 string ...
0<> <'p%x {e3} o> | 1:BOL(2)
0<> <'p%x {e3} o> | 2:BRANCH(26)
0<> <'p%x {e3} o> | 3:ANYOF [^ \ x09 \ x0a \x0c \ x0d#& - )/:@ [ - \ ^ { - 〜] [^ {unicode} + utf8 :: IsSpacePerl](14)
失败...
0<><'p%x {e3} o> | 26:BRANCH(94)
0<><'p%x {e3} o> | 27:TRIE-EXACT ['](95)
0<> <'p%x {e3} o> |州:1接受:N Charid:2 CP:27 After State:3
1<'> < p%x {e3} o d> |状态:3接受:Y Charid:0 CP:0状态后:0
得到1个可能的匹配
TRIE匹配单词#2,继续
只留下一个匹配,短路:#2 <'>
1<'> < p%x {e3} o d> | 63:SUSPEND(92)
1<'> < p%x {e3} o d> | 65:CURLYX [0] {1,32767}(89)
1<'> < p%x {e3} o d> | 88: WHILEM[2/2](0)
whilem: matched 0 out of 1..32767
1 <’> <p%x{e3}o d>| 67: BRANCH(84)
1 <’> <p%x{e3}o d>| 68: SUSPEND(88)
1 <’> <p%x{e3}o d>| 70: PLUS(82)
ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
64 <NTABILIDAD])> <| 82: SUCCEED(0)
subpattern success...
64 <NTABILIDAD])> <| 88: WHILEM[2/2](0)
whilem: matched 1 out of 1..32767
64 <NTABILIDAD])> <| 67: BRANCH(84)
64 <NTABILIDAD])> <| 68: SUSPEND(88)
64 <NTABILIDAD])> <| 70: PLUS(82)
ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
failed...
failed...
64 <NTABILIDAD])> <| 84: BRANCH(87)
64 <NTABILIDAD])> <| 85: EXACT <’’>(88)
failed...
BRANCH failed...
whilem: failed, trying continuation...
64 <NTABILIDAD])> <| 89: NOTHING(90)
64 <NTABILIDAD])> <| 90: SUCCEED(0)
subpattern success...
64 <NTABILIDAD])> <| 92: EXACT <’>(95)
failed...
BRANCH failed...
Match failed
Match failed.
Freeing REx: \"%n ^ (?: [^’%\"\s~:/@\#|^&\[\]()\\{}]%n [^%\"\s~:/\"...

I still like my solution better. It’s shorter.






EDIT



It appears that the Java version really is taking 100 times more steps than the Perl version of an identical pattern, and I have no idea why — other than that the Perl regex compiler is about 100 times smarter at optimizations than the Java regex compiler, which doesn’t ever do any to speak of, and should.



Here’s the equivalent Java program. I’ve removed the leading anchor so that we can loop properly.

$ cat java.crap 
import java.util.regex.*;

public class crap {

public static void
main(String[ ] argv) {
String input = \"’pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])\";
String regex = \"\n\"
+ \"(?: [^’\\"\\s~:/@\\#|^&\\[\\]()\{}] # alt 1: unquoted \n\"
+ \" [^\\"\\s~:/@\\#|^&\\[\\]()\{}] * \n\"
+ \" | \\" (?: [^\\"]++ | \\"\\" )++ \\" # alt 2: double quoted \n\"
+ \" | ’ (?: [^’]++ | ’’ )++ ’ # alt 3: single quoted \n\"
+ \") \n\"
;
System.out.printf(\"Matching ‹%s› =~ qr{%s}x\n\n\", input, regex);

Pattern regcomp = Pattern.compile(regex, Pattern.COMMENTS);
Matcher regexec = regcomp.matcher(input);

int count;
for (count = 0; regexec.find(); count++) { $b$ b System.out.printf(\"Found match: ‹%s›\n\", regexec.group());
}
if (count == 0) {
System.out.printf(\"Match failed.\n\");
}
}
}

When run, that produces this:

$ javac -encoding UTF-8 crap.java && java -Dfile.encoding=UTF-8 crap 
Matching ‹’pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])› =~ qr{
(?: [^’\"\s~:/@\#|^&\[\]()\\{}] # alt 1: unquoted
[^\"\s~:/@\#|^&\[\]()\\{}] *
| \" (?: [^\"]++ | \"\" )++ \" # alt 2: double quoted
| ’ (?: [^’]++ | ’’ )++ ’ # alt 3: single quoted
)
}x

Found match: ‹pão›
Found match: ‹de›
Found match: ‹açúcar›
Found match: ‹itaucard›
Found match: ‹mastercard›
Found match: ‹platinum›
Found match: ‹SUSTENTABILIDAD›

As you can plainly see, pattern matching in Java has quite a lot to be said about it, absolutely none of which would pass the potty-mouth police. It’s simply a royal pain in the ass.


Can someone explain why Java's regex engine goes into catastrophic backtracking mode on this regex? Every alternation is mutually exclusive with every other alternation, from what I can tell.

^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*|
\"(?:[^\"]+|\"\")+\"|
'(?:[^']+|'')+')

Text: 'pão de açúcar itaucard mastercard platinum SUSTENTABILIDADE])

Adding possessive matching to some of the alternations fixes the problem, but I have no idea why - Java's regex lib must be extremely buggy to backtrack on mutually exclusive branches.

 ^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*|
 \"(?:[^\"]++|\"\")++\"|
 '(?:[^']++|'')++')

解决方案

EDIT: Added Java version at end — despite being inherently clumsy, unreadable, and unmaintainable.


¡No More Ugly Patterns!

The first thing you need to do is to write your regex in a way that can stand any possible hope of being readable — and consequently maintainable — by human beings. The second thing you need to do is profile this to see what it is actually doing.

That means at a minimum you need to compile it in Pattern.COMMENTS mode (or prefix "(?x)") and then add whitespace to give some visual elbow room. As near as I can make it out, the pattern that you are actually trying to match against is this one:

^ 
(?: [^'"\s~:/@\#|^&\[\]()\\{}]    # alt 1: unquoted
    [^"\s~:/@\#|^&\[\]()\\{}] *
  | " (?: [^"]+ | "" )+ "        # alt 2: double quoted
  | ' (?: [^']+ | '' )+ '        # alt 3: single quoted
)

As you see, I’ve introduced both vertical and horizontal whitespace in the places where I could, in order to guide the eye and mind as a sort of cognitive chunking. I’ve also removed all your extraneous backslashes. Those are all either outright errors or else obfuscators that do nothing but confuse the reader.

Notice how when applying vertical whitespace, I’ve made the parts that are the same from one line to the next occur at the same column so that you can immediately see which parts are the same and which parts are different.

Having done that, I can finally see that what you here is a match anchored to the start which is then followed by a choice of three alternatives. I’ve therefore labelled those three alternatives with a descriptive comment so one doesn't have to guess.

I also notice that your first alternative has two subtly different (negated) square‐bracketed character classes. The second one is lacking the single quote exclusion seen in the first one. Is this intentional? Even if it is, I find this far too much duplication for my tastes; some or all of that should be in a variable so that you don’t risk update coherency issues.


Profiling

The second and more important of the two things that you have to do is to profile this. You need to see exactly what regex program that that pattern is being compiled into, and you need to trace its execution as it runs over your data.

Java’s Pattern class is not currently capable of doing this, although I have spoken at some length about it with the current code custodian at OraSun, and he is both keen to add this capability to Java and thinks he knows exactly how to do it. He even sent me a prototype that does the first part: the compilation. So I expect it to be available one of these days.

Meanwhile, let us turn instead to a tool in which regexes are an integral part of the programming language proper, not something bolted on as an awkward afterthought. Although several languages meet that criterion, in none has the art of pattern matching reached the level of sophistication seen in Perl.

Here then is an equivalent program.

#!/usr/bin/env perl
use v5.10;      # first release with possessive matches
use utf8;       # we have utf8 literals
use strict;     # require variable declarations, etc
use warnings;   # check for boneheadedness

my $match = qr{
    ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}]
          [^"\s~:/@\#|^&\[\]()\\{}] *
        | " (?: [^"]+ | "" )+ "
        | ' (?: [^']+ | '' )+ '
    )
}x;

my $text = "'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])";

my $count = 0;

while ($text =~ /$match/g) {
    print "Got it: $&\n";
    $count++;
}

if ($count == 0) {
    print "Match failed.\n";
}

If we run that program, we get the expected answer that the match failed. The question is why and how.

We now want to look at two things: we want to see what regex program that pattern compiles into, and we then want to trace the execution of that regex program.

Both of these are controlled with the

use re "debug";

pragma, which can also be specified on the command line via -Mre=debug. This is what we’ll do here to avoid hacking on the source code.

Regex Compilation

The re debugging pragma will normally show both the compilation of the pattern as well as its execution. To separate those, we can use Perl’s "compile only" switch, -c, which does not attempt to execution the program it has compiled. That way all we look at is the compiled pattern. Do these produces the following 36 lines of output:

$ perl -c -Mre=debug /tmp/bt
Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (79)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (79)
        <"> (29)
  29:     CURLYX[0] {1,32767} (49)
  31:       BRANCH (44)
  32:         PLUS (48)
  33:           ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  44:       BRANCH (FAIL)
  45:         EXACT <""> (48)
  47:       TAIL (48)
  48:     WHILEM[1/2] (0)
  49:     NOTHING (50)
  50:     EXACT <"> (79)
        <'> (55)
  55:     CURLYX[0] {1,32767} (75)
  57:       BRANCH (70)
  58:         PLUS (74)
  59:           ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  70:       BRANCH (FAIL)
  71:         EXACT <''> (74)
  73:       TAIL (74)
  74:     WHILEM[2/2] (0)
  75:     NOTHING (76)
  76:     EXACT <'> (79)
  78: TAIL (79)
  79: END (0)
anchored(BOL) minlen 1 
/tmp/bt syntax OK
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

As you see, the compiled regex program is in something of a "regex assembly language" of its own. (It also happens to look very much like what the Java prototype that was demonstrated to me spits out, so I imagine you’ll one day see this sort of thing in Java, too.) All the details aren’t essential to this, but I will point out that the instruction at node 2 is a BRANCH, which if it fails goes on to execute branch 26, another BRANCH. That second BRANCH, which is the only other part of the regex program, consists of a single TRIE-EXACT node, because it knows that the alternatives have different starting literal strings. What happens w ithin those two trie branches we will discuss presently.

Regex Execution

Now it’s time to see what happens when it runs. The text string you are using causes quite a fair amount of backtracking, which means you would have a lot of output to wade through before it finally fails. How much output? Well, this much:

$ perl -Mre=debug /tmp/bt 2>&1 | wc -l
9987

I assume that 10,000 steps is what you meant by "catastrophic backtracking mode". Let’s see that we can’t whittle that down into something more comprehensible. Your input string is 61 characters in length. To better see what is happening, we can trim it down to just 'pão, which is only 4 characters. (Well, in NFC, that is; it’s five code points in NFD, but that changes nothing here). That results in 167 lines of output:

$ perl -Mre=debug /tmp/bt 2>&1 | wc -l
167

In fact, here are the lines of regex (compilation plus) execution profiling that you get when your string is this many characters long:

chars   lines   string
    1     63   ‹'›
    2     78   ‹'p›  
    3    109   ‹'pã›
    4    167   ‹'pão› 
    5    290   ‹'pão ›
    6    389   ‹'pão d›
    7    487   ‹'pão de›
    8    546   ‹'pão de ›
    9    615   ‹'pão de a›
   10    722   ‹'pão de aç›
  ....
   61   9987   ‹'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])›

Let’s look at the debugging output when the string is the four characters 'pão. I’ve omitted the compilation part this time to show only the execution part:

$ perl -Mre=debug /tmp/bt
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o"
UTF-8 string...
   0 <> <'p%x{e3}o>  |  1:BOL(2)
   0 <> <'p%x{e3}o>  |  2:BRANCH(26)
   0 <> <'p%x{e3}o>  |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                            failed...
   0 <> <'p%x{e3}o>  | 26:BRANCH(78)
   0 <> <'p%x{e3}o>  | 27:  TRIE-EXACT["'](79)
   0 <> <'p%x{e3}o>  |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o>  |      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                            got 1 possible matches
                            TRIE matched word #2, continuing
                            only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o>  | 55:  CURLYX[0] {1,32767}(75)
   1 <'> <p%x{e3}o>  | 74:    WHILEM[2/2](0)
                              whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o>  | 57:      BRANCH(70)   1 <'> <p%x{e3}o>          | 58:        PLUS(74)
                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 3 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:            BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                        failed...
   5 <'p%x{e3}o> <>  | 70:            BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:            NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   4 <'p%x{e3}> <o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:            BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                  BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                              failed...
   5 <'p%x{e3}o> <>  | 70:                  BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                  NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
                                        failed...
   4 <'p%x{e3}> <o>  | 70:            BRANCH(73)
   4 <'p%x{e3}> <o>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   4 <'p%x{e3}> <o>  | 75:            NOTHING(76)
   4 <'p%x{e3}> <o>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   2 <'p> <%x{e3}o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   2 <'p> <%x{e3}o>  | 57:            BRANCH(70)
   2 <'p> <%x{e3}o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 2 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                  BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                              failed...
   5 <'p%x{e3}o> <>  | 70:                  BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                  NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
   4 <'p%x{e3}> <o>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:                  BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:                    PLUS(74)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                      WHILEM[2/2](0)
                                                whilem: matched 3 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:                        BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:                          PLUS(74)
                                                    ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647.
..
                                                    failed...
   5 <'p%x{e3}o> <>  | 70:                        BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:                          EXACT <''>(74)
                                                    failed...
                                                  BRANCH failed...
                                                whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:                        NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:                        EXACT <'>(79)
                                                  failed...
                                                failed...
                                              failed...
   4 <'p%x{e3}> <o>  | 70:                  BRANCH(73)
   4 <'p%x{e3}> <o>  | 71:                    EXACT <''>(74)
                                              failed...
                                            BRANCH failed...
                                          whilem: failed, trying continuation...
   4 <'p%x{e3}> <o>  | 75:                  NOTHING(76)
   4 <'p%x{e3}> <o>  | 76:                  EXACT <'>(79)
                                            failed...
                                          failed...
                                        failed...
   2 <'p> <%x{e3}o>  | 70:            BRANCH(73)
   2 <'p> <%x{e3}o>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   2 <'p> <%x{e3}o>  | 75:            NOTHING(76)
   2 <'p> <%x{e3}o>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
                                  failed...
   1 <'> <p%x{e3}o>  | 70:      BRANCH(73)
   1 <'> <p%x{e3}o>  | 71:        EXACT <''>(74)
                                  failed...
                                BRANCH failed...
                              failed...
                            failed...
                          BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

What you see happening there is that the trie quickly branches down to node 55, which is the last of your three alternatives after it has matched the single quote, because your target string starts with a single quote. That is this one:

  | ' (?: [^']+ | '' )+ '        # alt 3: single quoted

Node 55 is this trie branch:

        <'> (55)
  55:     CURLYX[0] {1,32767} (75)
  57:       BRANCH (70)
  58:         PLUS (74)
  59:           ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  70:       BRANCH (FAIL)
  71:         EXACT <''> (74)

And here is the execution trace showing where your catastrophic backoff is happening:

   1 <'> <p%x{e3}o>  | 74:    WHILEM[2/2](0)
                              whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o>  | 57:      BRANCH(70)
   1 <'> <p%x{e3}o>  | 58:        PLUS(74)
                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 3 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   5 <'p%x{e3}o> <>  | 57:            BRANCH(70)
   5 <'p%x{e3}o> <>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                        failed...
   5 <'p%x{e3}o> <>  | 70:            BRANCH(73)
   5 <'p%x{e3}o> <>  | 71:              EXACT <''>(74)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
   5 <'p%x{e3}o> <>  | 75:            NOTHING(76)
   5 <'p%x{e3}o> <>  | 76:            EXACT <'>(79)
                                      failed...
                                    failed...
   4 <'p%x{e3}> <o>  | 74:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
   4 <'p%x{e3}> <o>  | 57:            BRANCH(70)
   4 <'p%x{e3}> <o>  | 58:              PLUS(74)
                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
   5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                          whilem: matched 2 out of 1..32767

Node 58 gobbled up all 3 remaining characters in the string, pão. That caused the terminating exact match of a single quote to fail. It therefore attempts your alternative, which is a pair of single quotes, and that also fails.

It is at this point I have to question your pattern. Shouldn’t

' (?: [^']+ | '' )+ '

really simply be this?

' [^']* '

So what is going on is that there are lot of ways for it to backtrack looking for something that can logically never occur. You have a nested quantifier, and that is causing all kinds of hopeless and mindless busywork.

If we swap simplify the pattern into this:

^ (?: [^'"\s~:/@\#|^&\[\]()\\{}] +
    | " [^"]* "
    | ' [^']* '
  )

It now gives the same number of trace output lines no matter the size of the input string: only 40, and that includes the compilation. Witness both compilation and execution on the full string:

Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (61)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (61)
        <"> (29)
  29:     STAR (41)
  30:       ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  41:     EXACT <"> (61)
        <'> (46)
  46:     STAR (58)
  47:       ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  58:     EXACT <'> (61)
  60: TAIL (61)
  61: END (0)
anchored(BOL) minlen 1 
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o de a%x{e7}%x{fa}car itaucard mast
ercard platinum S"...
UTF-8 string...
   0 <> <'p%x{e3}o >  |  1:BOL(2)
   0 <> <'p%x{e3}o >  |  2:BRANCH(26)
   0 <> <'p%x{e3}o >  |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                             failed...
   0 <> <'p%x{e3}o >  | 26:BRANCH(60)
   0 <> <'p%x{e3}o >  | 27:  TRIE-EXACT["'](61)
   0 <> <'p%x{e3}o >  |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o d> |      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                             got 1 possible matches
                             TRIE matched word #2, continuing
                             only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o d> | 46:  STAR(58)
                             ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
                             failed...
                           BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

I know you were thinking that possessive matching might be the answer here, but I think the real problem is the faulty logic in the original pattern. See how more sanely this runs now?

If we run it with your possessives on the old pattern, even though I don’t think that makes sense, we still get constant runtime, but it takes more steps. With this pattern

   ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}] +    # alt 1: unquoted
       | " (?: [^"]++ | "" )++ "        # alt 2: double quoted
       | ' (?: [^']++ | '' )++ '        # alt 3: single quoted
     )

The compilation plus execution profile is as follows:

Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
Final program:
   1: BOL (2)
   2: BRANCH (26)
   3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
  14:   STAR (95)
  15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
  26: BRANCH (FAIL)
  27:   TRIE-EXACT["'] (95)
        <"> (29)
  29:     SUSPEND (58)
  31:       CURLYX[0] {1,32767} (55)
  33:         BRANCH (50)
  34:           SUSPEND (54)
  36:             PLUS (48)
  37:               ANYOF[\x00-!#-\xff][{unicode_all}] (0)
  48:             SUCCEED (0)
  49:           TAIL (53)
  50:         BRANCH (FAIL)
  51:           EXACT <""> (54)
  53:         TAIL (54)
  54:       WHILEM[1/2] (0)
  55:       NOTHING (56)
  56:       SUCCEED (0)
  57:     TAIL (58)
  58:     EXACT <"> (95)
        <'> (63)
  63:     SUSPEND (92)
  65:       CURLYX[0] {1,32767} (89)
  67:         BRANCH (84)
  68:           SUSPEND (88)
  70:             PLUS (82)
  71:               ANYOF[\x00-&(-\xff][{unicode_all}] (0)
  82:             SUCCEED (0)
  83:           TAIL (87)
  84:         BRANCH (FAIL)
  85:           EXACT <''> (88)
  87:         TAIL (88)
  88:       WHILEM[2/2] (0)
  89:       NOTHING (90)
  90:       SUCCEED (0)
  91:     TAIL (92)
  92:     EXACT <'> (95)
  94: TAIL (95)
  95: END (0)
anchored(BOL) minlen 1 
Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o de a%x{e7}%x{fa}car itaucard mastercard platinum S"...
UTF-8 string...
   0 <> <'p%x{e3}o > |  1:BOL(2)
   0 <> <'p%x{e3}o > |  2:BRANCH(26)
   0 <> <'p%x{e3}o > |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                            failed...
   0 <> <'p%x{e3}o > | 26:BRANCH(94)
   0 <> <'p%x{e3}o > | 27:  TRIE-EXACT["'](95)
   0 <> <'p%x{e3}o > |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
   1 <'> <p%x{e3}o d>|      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                            got 1 possible matches
                            TRIE matched word #2, continuing
                            only one match left, short-circuiting: #2 <'>
   1 <'> <p%x{e3}o d>| 63:  SUSPEND(92)
   1 <'> <p%x{e3}o d>| 65:    CURLYX[0] {1,32767}(89)
   1 <'> <p%x{e3}o d>| 88:      WHILEM[2/2](0)
                                whilem: matched 0 out of 1..32767
   1 <'> <p%x{e3}o d>| 67:        BRANCH(84)
   1 <'> <p%x{e3}o d>| 68:          SUSPEND(88)
   1 <'> <p%x{e3}o d>| 70:            PLUS(82)
                                      ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
  64 <NTABILIDAD])> <| 82:              SUCCEED(0)
                                        subpattern success...
  64 <NTABILIDAD])> <| 88:          WHILEM[2/2](0)
                                    whilem: matched 1 out of 1..32767
  64 <NTABILIDAD])> <| 67:            BRANCH(84)
  64 <NTABILIDAD])> <| 68:              SUSPEND(88)
  64 <NTABILIDAD])> <| 70:                PLUS(82)
                                          ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                          failed...
                                        failed...
  64 <NTABILIDAD])> <| 84:            BRANCH(87)
  64 <NTABILIDAD])> <| 85:              EXACT <''>(88)
                                        failed...
                                      BRANCH failed...
                                    whilem: failed, trying continuation...
  64 <NTABILIDAD])> <| 89:            NOTHING(90)
  64 <NTABILIDAD])> <| 90:            SUCCEED(0)
                                      subpattern success...
  64 <NTABILIDAD])> <| 92:  EXACT <'>(95)
                failed...
              BRANCH failed...
Match failed
Match failed.
Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...

I still like my solution better. It’s shorter.


EDIT

It appears that the Java version really is taking 100 times more steps than the Perl version of an identical pattern, and I have no idea why — other than that the Perl regex compiler is about 100 times smarter at optimizations than the Java regex compiler, which doesn’t ever do any to speak of, and should.

Here’s the equivalent Java program. I’ve removed the leading anchor so that we can loop properly.

$ cat java.crap
import java.util.regex.*;

public class crap {

public static void
main(String[ ] argv) {
    String input = "'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])";
    String regex = "\n"
                + "(?: [^'\"\\s~:/@\\#|^&\\[\\]()\\\\{}]    # alt 1: unquoted         \n"
                + "    [^\"\\s~:/@\\#|^&\\[\\]()\\\\{}] *                     \n"
                + "  | \" (?: [^\"]++ | \"\" )++ \"       # alt 2: double quoted   \n"
                + "  | ' (?: [^']++ | '' )++ '       # alt 3: single quoted   \n"
                + ")                                                          \n"
                ;
    System.out.printf("Matching ‹%s› =~ qr{%s}x\n\n", input, regex);

    Pattern regcomp = Pattern.compile(regex, Pattern.COMMENTS);
    Matcher regexec = regcomp.matcher(input);

    int count;
    for (count = 0; regexec.find(); count++) {
       System.out.printf("Found match: ‹%s›\n", regexec.group());
    }
    if (count == 0) {
        System.out.printf("Match failed.\n");
    }
  }
}

When run, that produces this:

$ javac -encoding UTF-8 crap.java && java -Dfile.encoding=UTF-8 crap
Matching ‹'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])› =~ qr{
(?: [^'"\s~:/@\#|^&\[\]()\\{}]    # alt 1: unquoted         
    [^"\s~:/@\#|^&\[\]()\\{}] *                     
  | " (?: [^"]++ | "" )++ "       # alt 2: double quoted   
  | ' (?: [^']++ | '' )++ '       # alt 3: single quoted   
)                                                          
}x

Found match: ‹pão›
Found match: ‹de›
Found match: ‹açúcar›
Found match: ‹itaucard›
Found match: ‹mastercard›
Found match: ‹platinum›
Found match: ‹SUSTENTABILIDAD›

As you can plainly see, pattern matching in Java has quite a lot to be said about it, absolutely none of which would pass the potty-mouth police. It’s simply a royal pain in the ass.

这篇关于这个正则表达式不应该发生灾难性的回溯的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆