为什么Java正则表达式引擎会在+重复上抛出StringIndexOutOfBoundsException? [英] Why does Java regex engine throw StringIndexOutOfBoundsException on a + repetition?

查看:148
本文介绍了为什么Java正则表达式引擎会在+重复上抛出StringIndexOutOfBoundsException?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个正则表达式模式来找到斐波纳契数(为什么,我只是这样做并不重要)。它按预期工作得非常好(在ideone.com上查看):

I've written a regex pattern to find Fibonacci numbers (it doesn't matter why, I just did). It works wonderfully as expected (see on ideone.com):

    String FIBONACCI = 
        "(?x) .{0,2} | (?: (?=(\\2?)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)++ . ";

    for (int n = 0; n < 1000; n++) {
        String s = new String(new char[n]);
        if (s.matches(FIBONACCI)) {
            System.out.print(n + " ");
        }
    } // 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

占有式重复(即 +主循环上的+ 是至关重要的,因为您不希望使用此匹配算法进行回溯。但是,使重复回溯(即主要循环上的 + )不会导致不匹配,而是导致运行时异常! (在ideone.com上看到):

A possessive repetition (i.e. ++ on the main "loop") is crucial, because you don't want backtracking with this matching algorithm. However, making the repetition backtrackable (i.e. just + on the main "loop") results not in mismatches, but rather a runtime exception!!! (as seen on ideone.com):

Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
    String index out of range: -1

    at java.lang.String.charAt(String.java:686)
    at java.lang.Character.codePointAt(Character.java:2335)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3344)
    at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:3994)
    at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:3966)
    at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3916)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Matcher.match(Matcher.java:1127)
    at java.util.regex.Matcher.matches(Matcher.java:502)
    at java.util.regex.Pattern.matches(Pattern.java:930)
    at java.lang.String.matches(String.java:2090)

谁能解释一下h在这里?这是Java正则表达式引擎中的错误吗?

Can someone explain what happened here? Is this a bug in the Java regex engine?

推荐答案

错误ID 6984178



引发投掷的许多错误 StringIndexOutOfBoundsException 请参阅:搜索结果。特别报告此内容并在内部接受为错误ID 6984178 (可能需要一段时间才能显示在外部数据库中)。

Bug ID 6984178

There are many bugs related to the engine throwing StringIndexOutOfBoundsException (see: search results. This one in particular has been reported and internally accepted as Bug ID 6984178 (it may take a while for this to show up in the external database).

这是一个更简单的模式,可以重现错误(另见ideone.com ):

Here's a much simpler pattern that reproduces the bug (see also on ideone.com):

System.out.println(
   "abaab".matches("(?x) (?: (?=(a+)) \\1 b )* x")
); // StringIndexOutOfBounds: -1

请注意使用 *? * + 只是按预期返回 false

Note that using *? or *+ simply returns false as expected.

看起来这个问题是由于在前瞻中有一个对捕获组的引用而试图回溯贪婪的重复时触发的:越界索引是差异第一个和第二个之间的长度 a + (例如aabaaaaab获取 -3 )。

It looks like the problem is triggered by the attempt to backtrack a greedy repetition when there's a reference to a capturing group inside a lookahead: the out of bounds index is the difference in length between the first and the second a+ (e.g. "aabaaaaab" gets -3).

可能需要调试 java.util.regex.Pattern 源代码,以精确查明确切内容错误的本质。

One would likely have to debug the java.util.regex.Pattern source code to pinpoint the exact nature of the bug.

这里有一个更详细的片段,展示了引擎获得的疯狂情况在这种模式上:

Here's a more verbose snippet to show just how bonkers the engine gets on this pattern:

String FIBONACCI = 
    "(?x) .{0,2} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)+ . ";

for (int n = 0; n < 1000; n++) {
    String s = new String(new char[n]);
    try {
        if (s.matches(FIBONACCI)) {
            System.out.printf("%n%s", n);
        }
    } catch (StringIndexOutOfBoundsException e) {
        String index = e.getMessage().replace("String index out of range: ", "");
        System.out.printf(" <%s:%s>", n, index);
    }
}

(稍加编辑)输出为(见于ideone.com ):

The (slightly edited) output is (as seen on ideone.com):

0 1 2 3 <5:-1>
6 <7:-1> ... <12:-1> <13:-3>
14 <15:-3> ... <33:-3> <34:-8>
35 <36:-8> ... <88:-8> <89:-21>
90 <91:-21> ... <232:-21> <233:-55>
234 <235:-55> ... <609:-55> <610:-144>
611 <612:-144> ...

所以引擎试图在-1,-3,-8处访问字符串索引,-21,-55,-144等。请注意,这些是每隔一个斐波纳契数,但是为负数。另请注意,除了前几个数字之外,其余的匹配(6,14,35,...)是 NOT Fibonacci数字。

So somehow the engine tries to access string indices at -1, -3, -8, -21, -55, -144, etc. Note that these are every other Fibonacci number, but negative. Note also that beyond the first few numbers, the rest of the matches (6, 14, 35, ...) are NOT Fibonacci numbers.

尽管该模式最初是为了占有量量词的必要性而编写的,但实际上回溯重复仍然会产生正确的答案(假设引擎不像Java那样错误)。这是.NET引擎上的C#实现(另见ideone.com ):

While the pattern was originally written with the necessity for possessive quantifier in mind, in fact a backtracking repetition will still yield the correct answer (assuming the engine isn't buggy like Java's). Here's a C# implementation on the .NET engine (see also on ideone.com):

Regex r = new Regex(
  @"(?x) ^.{0,1}$ | ^(?: (?=(\2?)) (?=(\2\3|^.)) (?=(\1)) \2)+ . $ "
);

for (int n = 0; n < 1000; n++) {
  if (r.IsMatch("".PadLeft(n))) {
    Console.Write("{0} ", n);
  }
}
// 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 

正如您所看到的,即使回溯 + 循环,输出也是正确的。事实上,正是因为它是一个回溯循环,特殊情况可以仅限于。{0,1} 而不是。{0,2 }

As you can see, the output is correct, even with a backtracking + "loop". In fact, precisely because it's a backtracking loop, the special case can be limited to just .{0,1} instead of .{0,2}.

这在Java中按预期工作。此外,因为它不情愿,我们还可以将特殊情况限制为。{0,1} 另见ideone.com ):

This works in Java as expected. Also, because it's reluctant, we can also limit the special case to just .{0,1} (see also on ideone.com):

String FIBONACCI = 
        "(?x) .{0,1} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)+? . ";






关于算法



公式



该模式利用 Fibonacci数字的第二个身份

这可以通过归纳来证明。

This can be proven by induction.

让我们使用这个版本的模式(在Java中工作,锚定时也适用于C#):

Let's use this version of the pattern (which works in Java, and when anchored, also works in C#):

                                                     summation 
free-space!                                            "loop"
    ↓                                                     ↓
   (?x) .{0,1} | (?: (?=(\2|^)) (?=(\2\3|^.)) (?=(\1)) \2)+? .
        \____/   \___________________________________/  ↑    ↑  
      base case      inductive case                    +Fi  +1
     for n = 0,1
                     (assertions don't "count" toward sum)!
                        $1 := $2    (or initialized with 0)
                        $2 := $2+$3 (or initialized with 1)
                        $3 := $1    (a temporary variable for the "swap")

基于 [$ 1,$ 2],Fibonacci序列生成非常简单:= [$ 2,$ 2 + $ 1] 过渡。由于断言是按顺序执行的,因此引入了临时变量(就像单指配伪代码版本一样)。

The Fibonacci sequence generation is straightforward, based on the [$1, $2] := [$2, $2+$1] transition. Since the assertions are performed sequentially, a temporary variable is introduced (just like the single-assignment "pseudocode" version).

这篇关于为什么Java正则表达式引擎会在+重复上抛出StringIndexOutOfBoundsException?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆