使用RegEx匹配大型输入时出现StackOverflowError [英] StackOverflowError when matching large input using RegEx

查看:141
本文介绍了使用RegEx匹配大型输入时出现StackOverflowError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用RegEx模式匹配结果时得到StackOverflowError.

I got StackOverflowError when matching the result using a RegEx pattern.

模式为(\d\*?(;(?=\d))?)+.此正则表达式用于验证输入:

The pattern is (\d\*?(;(?=\d))?)+. This regex is used to validate the input:

12345;4342;234*;123*;344324

输入是一个字符串,由由;分隔的值(仅数字)组成.每个值最后都可以包含一个*(用作其他匹配的通配符).字符串末尾没有;.

The input is a string consists of values (only digits) separated by ;. Each value could include one * at the end (used as wildcard for other matching). There is no ; at the end of the string.

问题是此正则表达式可以在少量值的情况下正常工作.但是,如果值的数量太大(超过300个),则会导致StackOverflowError.

The problem is that this regex works fine which small number of values. But when the numbers of values is too large (over 300), it will cause StackOverflowError.

final String TEST_REGEX = "(\\d\\*?(;(?=\\d))?)+";

// Generate string
StringBuilder builder = new StringBuilder();
int number = 123456;
for (int count = 1; count <= 300; count++) {
    builder.append(Integer.toString(number).concat(";"));
    number++;
}
builder.deleteCharAt(builder.lastIndexOf(";"))

builder.toString().matches(TEST_REGEX); //<---------- StackOverflowError

还有堆栈跟踪:

java.lang.StackOverflowError
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    ...

我认为模式中的前瞻会导致此错误,因为有很多查找,但是我还没有弄清楚如何减少它或解决该问题.

I think the lookahead in the pattern cause this error since there are lots of lookup but I haven't figured out how to reduce it or work around.

我真的很感谢任何建议,因为我没有RegEx经验.

I really appreciate any suggestion since I'm not experienced in RegEx.

推荐答案

在解决StackOverflowError ...

  1. 我想指出您当前的正则表达式(\d\*?(;(?=\d))?)+无法验证此条件.

每个值的末尾都可以包含一个*(用作其他匹配项的通配符)

Each value could include one * at the end (used as wildcard for other matching)

它无法拒绝案件23*4*4*;34*434*34,如此处 1 .

It fails to reject the case 23*4*4*;34*434*34, as seen here1.

您的正则表达式将对不匹配的输入进行不必要的回溯.

Your regex will do unnecessary backtracking on an non-matching input.

Java对组(\d\*?(;(?=\d))?)的每次重复使用一个堆栈帧(重复1次或更多次+).

Java uses one stack frame for each repetition of the group (\d\*?(;(?=\d))?) (which is repeated 1 or more time +).

正确的正则表达式为:

\d+\*?(?:;\d+\*?)*

请注意,这将拒绝*,根据您的要求,您是否愿意接受还是拒绝它并不清楚.

Note that this will reject *, which is not too clear from your requirement whether you want to accept or reject this.

这不能解决StackOverflow问题,因为组(?:;\d+\*?)的每次重复也将耗尽堆栈.要解决此问题,请使所有的量词都为 posssive ,因为语法不明确,因此无需回溯:

This doesn't fix the StackOverflow problem, since each repetition of the group (?:;\d+\*?) is also going to use up stack. To fix that, make all quantifiers possessive, since there is no need for backtracking, as the grammar is not ambiguous:

\d++\*?+(?:;\d++\*?+)*+

放入字符串文字:

"\\d++\\*?+(?:;\\d++\\*?+)*+"

我已经使用匹配和不匹配的输入测试了上述正则表达式,该输入具有3600多个令牌(由;分隔).

I have tested the regex above with matching and non matching input, which has more than 3600 tokens (separated by ;).

脚注

1 :regex101使用PCRE风格,与Java regex风格稍有不同.但是,您的正则表达式中使用的功能在它们之间是通用的,因此应该没有差异.

1: regex101 uses PCRE flavor, which is slightly different from Java regex flavor. However, the features used in your regex are common between them, so there should be no discrepancy.

附录

  • 实际上,根据我对您的正则表达式(\d\*?(;(?=\d))?)+(根据您的要求,不正确)的测试,使最外面的+所有格++似乎可以修复StackOverflowError问题,至少在我的测试中,使用了大约3600个令牌(用;分隔,字符串长约20k个字符).当针对不匹配的字符串进行测试时,它似乎也不会导致较长的执行时间.

  • Actually, from my testing with your regex (\d\*?(;(?=\d))?)+ (which is incorrect according to your requirement), making the outer most + possessive ++ seem to fix the StackOverflowError problem, at least in my testing with around 3600 tokens (separated by ;, the string is around 20k character long). It also doesn't seem to cause long execution time when testing against a non-matching string.

在我的解决方案中,使(?:;\d+\*?)组所有格的*量词足以解决StackOverflowError.

In my solution, make the * quantifier for the group (?:;\d+\*?) possessive is enough to resolve StackOverflowError.

"\\d+\\*?(?:;\\d+\\*?)*+"

但是,我让所有一切都保持安全.

However, I make everything possessive to be on the safe side.

这篇关于使用RegEx匹配大型输入时出现StackOverflowError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆