尝试在 Java 中使用正则表达式时堆栈溢出 [英] Stack overflow when trying to use regex in java

查看:57
本文介绍了尝试在 Java 中使用正则表达式时堆栈溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了一些关于如何优化正则表达式的文章,但没有一个答案(更少的组,使用 {X,Y} 而不是 *)似乎可以阻止我的正则表达式出现堆栈溢出错误.

I have read up on some articles on how to optimize regex but none of the answers (less groups, using {X,Y} instead of *) seemed to stop my regex from getting a stack overflow error.

我正在尝试通过文件进行动态搜索.假设我正在一个非常大的文件(2-4 mb)中搜索我打赌你找不到我".我的正则表达式生成器会生成正则表达式:

I am trying to make a dynamic search through a file. Lets say i am searching for 'i bet you cannot find me' in a file that is pretty large (2-4 mb). My regex generator would generate the regex:

i(?:.|\s)*?bet(?:.|\s)*?you(?:.|\s)*?cannot(?:.|\s)*?find(?:.|\s)*?me

这个正则表达式的想法是,无论单词之间有什么字符或空格,它都能找到确切的短语.但是,当我尝试使用:

the idea to this regex is that it finds the exact phrase no matter what characters or white space comes between the words. However when i try to use:

Pattern p = Pattern.compile(generatedRegex, Pattern.MULTILINE);
Matcher m = p.matcher(fileContentsAsString);
while (m.find()) {
System.out.println(m.group())
}

我收到堆栈溢出错误.我知道正则表达式使用递归,但看起来这不是正则表达式那么糟糕.有什么办法可以优化这个正则表达式吗?谢谢!

I am getting a stack overflow error. I know that regex use recursion but it doesnt seem like this is that bad of a regex. Is there any way I can optimize this regex? Thanks!

答案:

Pattern p = Pattern.compile("i(?:.*)bet(?:.*)you(?:.*)cannot(?:.*)find(?:.*?)me", Pattern.DOTALL);

是我最终使用的模式/正则表达式.看起来很快,不再出现堆栈溢出异常

is the pattern/regex that I ultimately am using. Seems fast and no longer getting a stack overflow exception

推荐答案

我认为由于您不情愿的限定符 (*?),您正在得到很多回溯.防止回溯的一种方法是使用原子分组 (?>X) 和/或所有格限定符 (*+).

I think you are getting a lot of backtracking because of your reluctant qualifiers (*?). One way to prevent backtracking is to use atomic grouping (?>X), and/or possessive qualifier (*+).

根据评论,您也更喜欢仅捕获i"最接近下注"的以减少整体匹配的长度.由于您希望获得与其余单词最接近的i",那么在我为第二个单词添加负向前瞻的地方,您还可以为单词 1 添加一个负向前瞻,就在它旁边.换句话说,(?!bet) 会变成 (?!i)(?!bet)(?!i|bet).我已编辑以下代码以包含此要求.

According to the comments, you also prefer to capture only the "i" that is nearest to "bet" to reduce the length of the overall match. Since you want to get the closest 'i' to the rest of the words, then in the place where I added negative lookahead for word two, you would put also a negative lookahead for word one, right beside it. In other words, (?!bet) would become (?!i)(?!bet) or (?!i|bet). I have edited the code below to include this requirement.

String fileContentsAsString = "ii ... bet ... you, ibetyouyou";
String regex = "i(?>(?!i|bet).)*+bet(?>(?!you).)*+you";
Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Matcher m = p.matcher(fileContentsAsString);
while (m.find()) {
    System.out.println(m.group());
}

输出:

我……打赌……你

ibetyou

说明 (来源):

"不情愿量词的工作方式是,每次它应该尝试匹配时,它首先尝试让正则表达式的下一部分匹配.因此,它在每次迭代开始时有效地进行了前瞻,这可能会变得非常昂贵,尤其是当量化部分每次迭代仅匹配一个字符时,例如 .*?"

"The way a reluctant quantifier works is, each time it's supposed to try to match, it first tries to let the next part of the regex match instead. So it's effectively doing a lookahead at the beginning of each iteration, which can get pretty expensive, especially when the quantified part only matches one character per iteration, like .*?"

这篇关于尝试在 Java 中使用正则表达式时堆栈溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆