Java正则表达式匹配开始/结束标签导致堆栈溢出 [英] Java regex to match start/end tags causes stack overflow

查看:353
本文介绍了Java正则表达式匹配开始/结束标签导致堆栈溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Java Pattern类的标准实现使用递归来实现多种形式的正则表达式(例如,某些运算符,替换).

The standard implementation of the Java Pattern class uses recursion to implement many forms of regular expressions (e.g., certain operators, alternation).

这种方法会导致输入字符串超过(相对较小)长度(可能不超过1,000个字符)而导致堆栈溢出问题,具体取决于所涉及的正则表达式.

This approach causes stack overflow issues with input strings that exceed a (relatively small) length, which may not even be more than 1,000 characters, depending on the regex involved.

一个典型的例子是下面的正则表达式,它使用交替从周围的XML字符串中提取可能包含多行的元素(名为Data):

A typical example of this is the following regex using alternation to extract a possibly multiline element (named Data) from a surrounding XML string, which has already been supplied:

<Data>(?<data>(?:.|\r|\n)+?)</Data>

上面的正则表达式与Matcher.find()方法一起使用,以读取数据"捕获组并按预期工作,直到提供的输入字符串的长度超过1200个字符左右,在这种情况下,这会导致堆栈溢出

The above regex is used in with the Matcher.find() method to read the "data" capturing group and works as expected, until the length of the supplied input string exceeds 1,200 characters or so, in which case it causes a stack overflow.

是否可以重写上述正则表达式以避免堆栈溢出问题?

Can the above regex be rewritten to avoid the stack overflow issue?

推荐答案

有关

有时regex Pattern类将抛出StackOverflowError.这是已知错误#5050507 ,自Java 1.4起已在java.util.regex程序包中.该错误将保留,因为它具有无法修复"状态.发生此错误的原因是Pattern类将正则表达式编译为一个小程序,然后执行该小程序以查找匹配项.该程序以递归方式使用,有时,当进行过多的递归调用时,会发生此错误.参见说明错误以获取更多详细信息. 似乎它主要是由使用交替触发的.

Sometimes the regex Pattern class will throw a StackOverflowError. This is a manifestation of the known bug #5050507, which has been in the java.util.regex package since Java 1.4. The bug is here to stay because it has "won't fix" status. This error occurs because the Pattern class compiles a regular expression into a small program which is then executed to find a match. This program is used recursively, and sometimes when too many recursive calls are made this error occurs. See the description of the bug for more details. It seems it's triggered mostly by the use of alternations.

您的正则表达式(具有交替形式)与两个标签之间的任意1个以上的字符匹配.

Your regex (that has alternations) is matching any 1+ characters between two tags.

您可以在Pattern.DOTALL修饰符(或等效的嵌入标志(?s))中使用惰性点匹配模式,这也将使.匹配换行符:

You may either use a lazy dot matching pattern with the Pattern.DOTALL modifier (or the equivalent embedded flag (?s)) that will make the . match newline symbols as well:

(?s)<Data>(?<data>.+?)</Data>

请参见此regex演示

但是,在输入量巨大的情况下,惰性点匹配模式仍会占用大量内存.最好的解决方法是使用 展开循环方法 :

However, lazy dot matching patterns still consume lots of memory in case of huge inputs. The best way out is to use an unroll-the-loop method:

<Data>(?<data>[^<]*(?:<(?!/?Data>)[^<]*)*)</Data>

请参见 regex演示

详细信息:

  • <Data>-文字<Data>
  • (?<data>-捕获组数据"的开始
    • [^<]*-除<
    • 以外的零个或多个字符
    • (?:<(?!/?Data>)[^<]*)*-0个或多个序列:
      • <(?!/?Data>)-一个<,后面没有Data>/Data>
      • [^<]*-除<
      • 以外的零个或多个字符
      • <Data> - literal text <Data>
      • (?<data> - start of the capturing group "data"
        • [^<]* - zero or more characters other than <
        • (?:<(?!/?Data>)[^<]*)* - 0 or more sequences of:
          • <(?!/?Data>) - a < that is not followed with Data> or /Data>
          • [^<]* - zero or more characters other than <

          这篇关于Java正则表达式匹配开始/结束标签导致堆栈溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆