Java正则表达式匹配开始/结束标签导致堆栈溢出 [英] Java regex to match start/end tags causes stack overflow
问题描述
Java
Pattern
类的标准实现使用递归来实现多种形式的正则表达式(例如,某些运算符,替换).
The standard implementation of the Java
Pattern
class uses recursion to implement many forms of regular expressions (e.g., certain operators, alternation).
这种方法会导致输入字符串超过(相对较小)长度(可能不超过1,000个字符)而导致堆栈溢出问题,具体取决于所涉及的正则表达式.
This approach causes stack overflow issues with input strings that exceed a (relatively small) length, which may not even be more than 1,000 characters, depending on the regex involved.
一个典型的例子是下面的正则表达式,它使用交替从周围的XML字符串中提取可能包含多行的元素(名为Data
):
A typical example of this is the following regex using alternation to extract a possibly multiline element (named Data
) from a surrounding XML string, which has already been supplied:
<Data>(?<data>(?:.|\r|\n)+?)</Data>
上面的正则表达式与Matcher.find()
方法一起使用,以读取数据"捕获组并按预期工作,直到提供的输入字符串的长度超过1200个字符左右,在这种情况下,这会导致堆栈溢出
The above regex is used in with the Matcher.find()
method to read the "data" capturing group and works as expected, until the length of the supplied input string exceeds 1,200 characters or so, in which case it causes a stack overflow.
是否可以重写上述正则表达式以避免堆栈溢出问题?
Can the above regex be rewritten to avoid the stack overflow issue?
推荐答案
有关
有时regex Sometimes the regex 您的正则表达式(具有交替形式)与两个标签之间的任意1个以上的字符匹配. Your regex (that has alternations) is matching any 1+ characters between two tags. 您可以在 You may either use a lazy dot matching pattern with the 请参见此regex演示 但是,在输入量巨大的情况下,惰性点匹配模式仍会占用大量内存.最好的解决方法是使用 展开循环方法 : However, lazy dot matching patterns still consume lots of memory in case of huge inputs. The best way out is to use an unroll-the-loop method: 请参见 regex演示 详细信息: 这篇关于Java正则表达式匹配开始/结束标签导致堆栈溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!Pattern
类将抛出StackOverflowError
.这是已知错误#5050507 ,自Java 1.4起已在java.util.regex
程序包中.该错误将保留,因为它具有无法修复"状态.发生此错误的原因是Pattern
类将正则表达式编译为一个小程序,然后执行该小程序以查找匹配项.该程序以递归方式使用,有时,当进行过多的递归调用时,会发生此错误.参见说明错误以获取更多详细信息. 似乎它主要是由使用交替触发的.
Pattern
class will throw a StackOverflowError
. This is a manifestation of the known bug #5050507, which has been in the java.util.regex
package since Java 1.4. The bug is here to stay because it has "won't fix" status. This error occurs because the Pattern
class compiles a regular expression into a small program which is then executed to find a match. This program is used recursively, and sometimes when too many recursive calls are made this error occurs. See the description of the bug for more details. It seems it's triggered mostly by the use of alternations.Pattern.DOTALL
修饰符(或等效的嵌入标志(?s)
)中使用惰性点匹配模式,这也将使.
匹配换行符:Pattern.DOTALL
modifier (or the equivalent embedded flag (?s)
) that will make the .
match newline symbols as well:(?s)<Data>(?<data>.+?)</Data>
<Data>(?<data>[^<]*(?:<(?!/?Data>)[^<]*)*)</Data>
<Data>
-文字<Data>
(?<data>
-捕获组数据"的开始
[^<]*
-除<
(?:<(?!/?Data>)[^<]*)*
-0个或多个序列:
<(?!/?Data>)
-一个<
,后面没有Data>
或/Data>
[^<]*
-除<
<Data>
- literal text <Data>
(?<data>
- start of the capturing group "data"
[^<]*
- zero or more characters other than <
(?:<(?!/?Data>)[^<]*)*
- 0 or more sequences of:
<(?!/?Data>)
- a <
that is not followed with Data>
or /Data>
[^<]*
- zero or more characters other than <