使用RegEx匹配大型输入时出现StackOverflowError [英] StackOverflowError when matching large input using RegEx
问题描述
使用RegEx模式匹配结果时得到StackOverflowError
.
I got StackOverflowError
when matching the result using a RegEx pattern.
模式为(\d\*?(;(?=\d))?)+
.此正则表达式用于验证输入:
The pattern is (\d\*?(;(?=\d))?)+
. This regex is used to validate the input:
12345;4342;234*;123*;344324
输入是一个字符串,由由;
分隔的值(仅数字)组成.每个值最后都可以包含一个*
(用作其他匹配的通配符).字符串末尾没有;
.
The input is a string consists of values (only digits) separated by ;
. Each value could include one *
at the end (used as wildcard for other matching). There is no ;
at the end of the string.
问题是此正则表达式可以在少量值的情况下正常工作.但是,如果值的数量太大(超过300个),则会导致StackOverflowError
.
The problem is that this regex works fine which small number of values. But when the numbers of values is too large (over 300), it will cause StackOverflowError
.
final String TEST_REGEX = "(\\d\\*?(;(?=\\d))?)+";
// Generate string
StringBuilder builder = new StringBuilder();
int number = 123456;
for (int count = 1; count <= 300; count++) {
builder.append(Integer.toString(number).concat(";"));
number++;
}
builder.deleteCharAt(builder.lastIndexOf(";"))
builder.toString().matches(TEST_REGEX); //<---------- StackOverflowError
还有堆栈跟踪:
java.lang.StackOverflowError
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
...
我认为模式中的前瞻会导致此错误,因为有很多查找,但是我还没有弄清楚如何减少它或解决该问题.
I think the lookahead in the pattern cause this error since there are lots of lookup but I haven't figured out how to reduce it or work around.
我真的很感谢任何建议,因为我没有RegEx经验.
I really appreciate any suggestion since I'm not experienced in RegEx.
推荐答案
在解决StackOverflowError
...
-
我想指出您当前的正则表达式
(\d\*?(;(?=\d))?)+
无法验证此条件.
每个值的末尾都可以包含一个*(用作其他匹配项的通配符)
Each value could include one * at the end (used as wildcard for other matching)
它无法拒绝案件23*4*4*;34*434*34
,如此处 1 .
It fails to reject the case 23*4*4*;34*434*34
, as seen here1.
您的正则表达式将对不匹配的输入进行不必要的回溯.
Your regex will do unnecessary backtracking on an non-matching input.
Java对组(\d\*?(;(?=\d))?)
的每次重复使用一个堆栈帧(重复1次或更多次+
).
Java uses one stack frame for each repetition of the group (\d\*?(;(?=\d))?)
(which is repeated 1 or more time +
).
正确的正则表达式为:
\d+\*?(?:;\d+\*?)*
请注意,这将拒绝*
,根据您的要求,您是否愿意接受还是拒绝它并不清楚.
Note that this will reject *
, which is not too clear from your requirement whether you want to accept or reject this.
这不能解决StackOverflow问题,因为组(?:;\d+\*?)
的每次重复也将耗尽堆栈.要解决此问题,请使所有的量词都为 posssive ,因为语法不明确,因此无需回溯:
This doesn't fix the StackOverflow problem, since each repetition of the group (?:;\d+\*?)
is also going to use up stack. To fix that, make all quantifiers possessive, since there is no need for backtracking, as the grammar is not ambiguous:
\d++\*?+(?:;\d++\*?+)*+
放入字符串文字:
"\\d++\\*?+(?:;\\d++\\*?+)*+"
我已经使用匹配和不匹配的输入测试了上述正则表达式,该输入具有3600多个令牌(由;
分隔).
I have tested the regex above with matching and non matching input, which has more than 3600 tokens (separated by ;
).
脚注
1 :regex101使用PCRE风格,与Java regex风格稍有不同.但是,您的正则表达式中使用的功能在它们之间是通用的,因此应该没有差异.
1: regex101 uses PCRE flavor, which is slightly different from Java regex flavor. However, the features used in your regex are common between them, so there should be no discrepancy.
附录
-
实际上,根据我对您的正则表达式
(\d\*?(;(?=\d))?)+
(根据您的要求,不正确)的测试,使最外面的+
所有格++
似乎可以修复StackOverflowError
问题,至少在我的测试中,使用了大约3600个令牌(用;
分隔,字符串长约20k个字符).当针对不匹配的字符串进行测试时,它似乎也不会导致较长的执行时间.
Actually, from my testing with your regex
(\d\*?(;(?=\d))?)+
(which is incorrect according to your requirement), making the outer most+
possessive++
seem to fix theStackOverflowError
problem, at least in my testing with around 3600 tokens (separated by;
, the string is around 20k character long). It also doesn't seem to cause long execution time when testing against a non-matching string.
在我的解决方案中,使(?:;\d+\*?)
组所有格的*
量词足以解决StackOverflowError
.
In my solution, make the *
quantifier for the group (?:;\d+\*?)
possessive is enough to resolve StackOverflowError
.
"\\d+\\*?(?:;\\d+\\*?)*+"
但是,我让所有一切都保持安全.
However, I make everything possessive to be on the safe side.
这篇关于使用RegEx匹配大型输入时出现StackOverflowError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!