正则表达式字符类的双重否定中的错误? [英] Bug in double negation of regex character classes?

查看:69
本文介绍了正则表达式字符类的双重否定中的错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR

为什么 [^ \\D2] [ ^ [^ 0-9] 2] [^ 2 [^ 0-9]] 在Java中获得不同的结果?

TL;DR
Why [^\\D2], [^[^0-9]2], [^2[^0-9]] get different results in Java?

用于测试的代码。你现在可以跳过它。

Code used for tests. You can skip it for now.

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };

System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
    System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
            test.matches(regexes[0]), test.matches(regexes[1]),
            test.matches(regexes[2]), test.matches(regexes[3]),
            test.matches(regexes[4]), test.matches(regexes[5]));






让我们说我需要正则表达式接受字符


Lets say I need regex which will accept characters that are


  • 不是数字,

  • 除了 2

  • not digits,
  • with exception of 2.

所以这样的正则表达式应代表除 0 之外的所有字符, 1 3 4 ,..., 9 。我至少可以通过两种方式来编写它,这将是所有与 2 无关的的总和:

So such regex should represent every character except 0, 1, 3,4, ... , 9. I can write it at least in two ways which will be sum of everything which is not digit with 2:


  • [[^ 0-9] 2]

  • [\ \D2]

  • [[^0-9]2]
  • [\\D2]

这两个正则表达式都按预期工作

Both of these regexes works as expected

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true

现在让我说我想要反转接受的字符。 (所以我想接受除2之外的所有数字)
我可以创建明确包含所有数字的正则表达式接受的字符如

Now lets say I want to reverse accepted characters. (so I want to accept all digits except 2) I could create regex which explicitly contains all accepted characters like


  • [013-9]

  • [013-9]

或尝试通过将其包装在另一个 [^ ...]中来否定前面描述的两个正则表达式/ code>喜欢

or try to negate two previously described regexes by wrapping it in another [^...] like


  • [^ \\D2]

  • [^ [^ 0-9] 2]

    甚至

  • [^ 2 [^ 0-9]]

  • [^\\D2]
  • [^[^0-9]2]
    or even
  • [^2[^0-9]]

但令我惊讶的是,只有前两个版本按预期工作

but to my surprise only first two versions work as expected

match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true 

所以我的问题是为什么 [^ [ ^ 0-9] 2] [^ 2 [^ 0-9]] 表现不如 [ ^ \D2] ?我可以以某种方式纠正这些正则表达式,以便我可以在其中使用 [^ 0-9] 吗?

So my question is why [^[^0-9]2] or [^2[^0-9]] doesn't behave as [^\D2]? Can I somehow correct these regexes so I would be able to use [^0-9] inside them?

推荐答案

在Oracle的 Pattern 类的实现的字符类解析代码中有一些奇怪的伏都教,它随JRE / JDK一起提供如果您是从Oracle网站下载的,或者您使用的是OpenJDK。我还没有检查其他JVM(特别是 GNU Classpath )实现如何解析问题中的正则表达式。

There are some strange voodoo going on in the character class parsing code of Oracle's implementation of Pattern class, which comes with your JRE/JDK if you downloaded it from Oracle's website or if you are using OpenJDK. I have not checked how other JVM (notably GNU Classpath) implementations parse the regex in the question.

从这一点来说,对 Pattern 类的任何引用及其内部工作都严格限于Oracle的实现(参考实现)。

From this point, any reference to Pattern class and its internal working is strictly restricted to Oracle's implementation (the reference implementation).

需要一些时间来阅读并理解 Pattern 类如何解析嵌套否定,如问题所示。但是,我编写了一个程序 1 来从 Pattern 对象中提取信息(使用 Reflection API )来查看编译结果。下面的输出是在Java HotSpot Client VM版本1.7.0_51上运行我的程序。

It would take some time to read and understand how Pattern class parses the nested negation as shown in the question. However, I have written a program1 to extract information from a Pattern object (with Reflection API) to look at the result of compilation. The output below is from running my program on Java HotSpot Client VM version 1.7.0_51.

1:目前,该程序是一个令人尴尬的混乱。当我完成并重构它时,我会用链接更新这篇文章。

[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

这里没什么好吃的。

[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match





[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

以下2个案例编译为与相同的程序[ ^ 0-9] 反直觉

The next 2 cases above are compiled to the same program as [^0-9], which is counter-intuitive.

[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match





[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

上述2个案例中没有任何异议,如问题中所述。

Nothing strange in the 2 cases above, as stated in the question.

[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
    [U+0030][U+0031]
    01
  Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match





[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

如问题中所述,这2个案例按预期工作。但是,请注意引擎如何补充第一个字符类( \D )并将设置差异应用于由剩余部分组成的字符类。

These 2 cases work as expected, as stated in the question. However, take note of how the engine takes complement of the first character class (\D) and apply set difference to the character class consisting of the leftover.

[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match





[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match





[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

通过Keppil在评论中的测试证实,上面的输出显示以上所有3个正则表达式被编译到同一个程序!

As confirmed via testing by Keppil in the comment, the output above shows that all 3 regex above are compiled to the same program!

[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

而不是 NOT(UNION(2,NOT(0-9) )),这是 0-13-9 ,我们得到 UNION(NOT(2),NOT(0-) 9)),相当于 NOT(2)

Instead of NOT(UNION(2, NOT(0-9)), which is 0-13-9, we get UNION(NOT(2), NOT(0-9)), which is equivalent to NOT(2).

[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

正则表达式 [^ 2 [^ [^ 0-9] ]] 由于同样的错误编译到与 [^ 2 [^ 0-9]] 相同的程序。

The regex [^2[^[^0-9]]] compiles to the same program as [^2[^0-9]] due to the same bug.

有一个未解决的bug似乎具有相同的性质: JDK-6609854

There is an unresolved bug that seems to be of the same nature: JDK-6609854.

以下是 Pattern 类的实现细节,在进一步阅读之前应该知道:

Below are implementation details of Pattern class that one should know before reading further:


  • Pattern class将 String 编译到一个节点链中,每个节点负责一个小而明确的职责,并将工作委托给链中的下一个节点。 Node class是所有节点的基类。

  • CharProperty class是所有与字符类相关的节点的基类。

  • BitClass class是 CharProperty 类的子类,它使用 boolean [] 数组来加速Latin-1字符的匹配(代码点< = 255)。它有一个 add 方法,允许在编译期间添加字符。

  • CharProperty.complement Pattern.union Pattern.intersection 是与set操作对应的方法。他们所做的是不言自明的。

  • Pattern.setDifference 不对称集合差异

  • Pattern class compiles a String into a chain of nodes, each node is in charge of a small and well-defined responsibility, and delegates the work to the next node in the chain. Node class is the base class of all the nodes.
  • CharProperty class is the base class of all character-class related Nodes.
  • BitClass class is a subclass of CharProperty class that uses a boolean[] array to speed up matching for Latin-1 characters (code point <= 255). It has an add method, which allows characters to be added during compilation.
  • CharProperty.complement, Pattern.union, Pattern.intersection are methods corresponding to set operations. What they do is self-explanatory.
  • Pattern.setDifference is asymmetric set difference.

在查看 CharProperty clazz(boolean consume)方法的完整代码之前,这是负责解析a的方法字符类,让我们看看代码的极简化版本,以了解代码的流程:

Before looking at the full code of CharProperty clazz(boolean consume) method, which is the method responsible for parsing a character class, let us look at an extremely simplified version of the code to understand the flow of the code:

private CharProperty clazz(boolean consume) {
    // [Declaration and initialization of local variables - OMITTED]
    BitClass bits = new BitClass();
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    // [CODE OMITTED]
                    ch = next();
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                // [CODE OMITTED]
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                continue;
            case 0:
                // [CODE OMITTED]
                // Unclosed character class is checked here
                break;
            case ']':
                // [CODE OMITTED]
                // The only return statement in this method
                // is in this case
                break;
            default:
                // [CODE OMITTED]
                break;
        }
        node = range(bits);

        // [CODE OMITTED]
        ch = peek();
    }
}

代码基本上读取输入(输入 String 转换为以null结尾 int [] 代码点),直至达到] 或String的结尾(未闭合的字符类)。

The code basically reads the input (the input String converted to null-terminated int[] of code points) until it hits ] or the end of the String (unclosed character class).

代码与继续打破开关块中混合在一起。但是,只要您意识到 continue 属于外部 for 循环并且 break 属于开关块,代码很容易理解:

The code is a bit confusing with continue and break mixing together inside the switch block. However, as long as you realize that continue belongs to the outer for loop and break belongs to the switch block, the code is easy to understand:


  • 结尾的案例将永远不会在开关语句后执行代码。

  • break 结尾的个案可能会在开关语句后执行代码(如果不是返回已经)。

  • Cases ending in continue will never execute the code after the switch statement.
  • Cases ending in break may execute the code after the switch statement (if it doesn't return already).

通过上面的观察,我们可以看到每当一个字符被发现是非特殊的,应该包含在字符类中,我们将在开关语句之后执行代码,其中 node = range(bits); 是第一个语句。

With the observation above, we can see that whenever a character is found to be non-special and should be included in the character class, we will execute the code after the switch statement, in which node = range(bits); is the first statement.

如果你检查源代码,方法 CHARP roperty range(BitClass位)解析字符类中的单个字符或字符范围。该方法返回传入的相同 BitClass 对象(添加了新字符)或返回 CharProperty 类的新实例。

If you check the source code, the method CharProperty range(BitClass bits) parses "a single character or a character range in a character class". The method either returns the same BitClass object passed in (with new character added) or return a new instance of CharProperty class.

接下来,让我们看一下代码的完整版本(包含部分解析字符类交集&& 省略):

Next, let us look at the full version of the code (with the part parsing character class intersection && omitted):

private CharProperty clazz(boolean consume) {
    CharProperty prev = null;
    CharProperty node = null;
    BitClass bits = new BitClass();
    boolean include = true;
    boolean firstInClass = true;
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    if (temp[cursor-1] != '[')
                        break;
                    ch = next();
                    include = !include;
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                firstInClass = false;
                node = clazz(true);
                if (prev == null)
                    prev = node;
                else
                    prev = union(prev, node);
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                // There are interesting things (bugs) here,
                // but it is not relevant to the discussion.
                continue;
            case 0:
                firstInClass = false;
                if (cursor >= patternLength)
                    throw error("Unclosed character class");
                break;
            case ']':
                firstInClass = false;

                if (prev != null) {
                    if (consume)
                        next();

                    return prev;
                }
                break;
            default:
                firstInClass = false;
                break;
        }
        node = range(bits);

        if (include) {
            if (prev == null) {
                prev = node;
            } else {
                if (prev != node)
                    prev = union(prev, node);
            }
        } else {
            if (prev == null) {
                prev = node.complement();
            } else {
                if (prev != node)
                    prev = setDifference(prev, node);
            }
        }
        ch = peek();
    }
}

查看中的代码案例'[':开关语句和开关语句后面的代码:

Looking at the code in case '[': of the switch statement and the code after the switch statement:


  • 节点变量存储解析单元的结果(独立字符,字符范围,速记字符类,POSIX / Unicode字符类或嵌套字符类)

  • prev 变量存储到目前为止的编译结果,并且在我们在节点中编译单元后立即更新。

  • The node variable stores the result of parsing a unit (a standalone character, a character range, a shorthand character class, a POSIX/Unicode character class or a nested character class)
  • The prev variable stores the compilation result so far, and is always updated right after we compiles a unit in node.

由于局部变量 boolean include ,它记录了字符类是否被否定,从未传递对于任何方法调用,它只能在此方法中单独执行。并且包含的唯一位置是在开关语句之后读取和处理。

Since the local variable boolean include, which records whether the character class is negated, is never passed to any method call, it can only be acted upon in this method alone. And the only place include is read and processed is after the switch statement.

这篇关于正则表达式字符类的双重否定中的错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆