正则表达式字符类的双重否定中的错误? [英] Bug in double negation of regex character classes?
问题描述
TL; DR
为什么 [^ \\D2]
, [ ^ [^ 0-9] 2]
, [^ 2 [^ 0-9]]
在Java中获得不同的结果?
TL;DR
Why [^\\D2]
, [^[^0-9]2]
, [^2[^0-9]]
get different results in Java?
用于测试的代码。你现在可以跳过它。
Code used for tests. You can skip it for now.
String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };
System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
test.matches(regexes[0]), test.matches(regexes[1]),
test.matches(regexes[2]), test.matches(regexes[3]),
test.matches(regexes[4]), test.matches(regexes[5]));
让我们说我需要正则表达式接受字符
Lets say I need regex which will accept characters that are
- 不是数字,
- 除了
2
。
- not digits,
- with exception of
2
.
所以这样的正则表达式应代表除 0
之外的所有字符, 1
, 3
, 4
,..., 9
。我至少可以通过两种方式来编写它,这将是所有与 2 无关的的总和:
So such regex should represent every character except 0
, 1
, 3
,4
, ... , 9
. I can write it at least in two ways which will be sum of everything which is not digit with 2:
-
[[^ 0-9] 2]
-
[\ \D2]
[[^0-9]2]
[\\D2]
这两个正则表达式都按预期工作
Both of these regexes works as expected
match , [[^0-9]2] , [\D2]
--------------------------
x , true , true
1 , false , false
2 , true , true
3 , false , false
^ , true , true
[ , true , true
] , true , true
现在让我说我想要反转接受的字符。 (所以我想接受除2之外的所有数字)
我可以创建明确包含所有数字的正则表达式接受的字符如
Now lets say I want to reverse accepted characters. (so I want to accept all digits except 2) I could create regex which explicitly contains all accepted characters like
-
[013-9]
[013-9]
或尝试通过将其包装在另一个 [^ ...]中来否定前面描述的两个正则表达式/ code>喜欢
or try to negate two previously described regexes by wrapping it in another [^...]
like
-
[^ \\D2]
-
[^ [^ 0-9] 2]
甚至 -
[^ 2 [^ 0-9]]
[^\\D2]
[^[^0-9]2]
or even[^2[^0-9]]
但令我惊讶的是,只有前两个版本按预期工作
but to my surprise only first two versions work as expected
match | [[^0-9]2] , [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]]
------+--------------------+-------------------------------------------
x | true , true | false , false , true , true
1 | false , false | true , true , false , true
2 | true , true | false , false , false , false
3 | false , false | true , true , false , true
^ | true , true | false , false , true , true
[ | true , true | false , false , true , true
] | true , true | false , false , true , true
所以我的问题是为什么 [^ [ ^ 0-9] 2]
或 [^ 2 [^ 0-9]]
表现不如 [ ^ \D2]
?我可以以某种方式纠正这些正则表达式,以便我可以在其中使用 [^ 0-9]
吗?
So my question is why [^[^0-9]2]
or [^2[^0-9]]
doesn't behave as [^\D2]
? Can I somehow correct these regexes so I would be able to use [^0-9]
inside them?
推荐答案
在Oracle的 Pattern
类的实现的字符类解析代码中有一些奇怪的伏都教,它随JRE / JDK一起提供如果您是从Oracle网站下载的,或者您使用的是OpenJDK。我还没有检查其他JVM(特别是 GNU Classpath )实现如何解析问题中的正则表达式。
There are some strange voodoo going on in the character class parsing code of Oracle's implementation of Pattern
class, which comes with your JRE/JDK if you downloaded it from Oracle's website or if you are using OpenJDK. I have not checked how other JVM (notably GNU Classpath) implementations parse the regex in the question.
从这一点来说,对 Pattern
类的任何引用及其内部工作都严格限于Oracle的实现(参考实现)。
From this point, any reference to Pattern
class and its internal working is strictly restricted to Oracle's implementation (the reference implementation).
需要一些时间来阅读并理解 Pattern
类如何解析嵌套否定,如问题所示。但是,我编写了一个程序 1 来从 Pattern
对象中提取信息(使用 Reflection API )来查看编译结果。下面的输出是在Java HotSpot Client VM版本1.7.0_51上运行我的程序。
It would take some time to read and understand how Pattern
class parses the nested negation as shown in the question. However, I have written a program1 to extract information from a Pattern
object (with Reflection API) to look at the result of compilation. The output below is from running my program on Java HotSpot Client VM version 1.7.0_51.
1:目前,该程序是一个令人尴尬的混乱。当我完成并重构它时,我会用链接更新这篇文章。
[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
这里没什么好吃的。
[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
以下2个案例编译为与相同的程序[ ^ 0-9]
,反直觉。
The next 2 cases above are compiled to the same program as [^0-9]
, which is counter-intuitive.
[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
上述2个案例中没有任何异议,如问题中所述。
Nothing strange in the 2 cases above, as stated in the question.
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0030][U+0031]
01
Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
如问题中所述,这2个案例按预期工作。但是,请注意引擎如何补充第一个字符类( \D
)并将设置差异应用于由剩余部分组成的字符类。
These 2 cases work as expected, as stated in the question. However, take note of how the engine takes complement of the first character class (\D
) and apply set difference to the character class consisting of the leftover.
[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
通过Keppil在评论中的测试证实,上面的输出显示以上所有3个正则表达式被编译到同一个程序!
As confirmed via testing by Keppil in the comment, the output above shows that all 3 regex above are compiled to the same program!
[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
而不是 NOT(UNION(2,NOT(0-9) ))
,这是 0-13-9
,我们得到 UNION(NOT(2),NOT(0-) 9))
,相当于 NOT(2)
。
Instead of NOT(UNION(2, NOT(0-9))
, which is 0-13-9
, we get UNION(NOT(2), NOT(0-9))
, which is equivalent to NOT(2)
.
[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
正则表达式 [^ 2 [^ [^ 0-9] ]]
由于同样的错误编译到与 [^ 2 [^ 0-9]]
相同的程序。
The regex [^2[^[^0-9]]]
compiles to the same program as [^2[^0-9]]
due to the same bug.
有一个未解决的bug似乎具有相同的性质: JDK-6609854 。
There is an unresolved bug that seems to be of the same nature: JDK-6609854.
以下是 Pattern
类的实现细节,在进一步阅读之前应该知道:
Below are implementation details of Pattern
class that one should know before reading further:
-
Pattern
class将String
编译到一个节点链中,每个节点负责一个小而明确的职责,并将工作委托给链中的下一个节点。Node
class是所有节点的基类。 -
CharProperty
class是所有与字符类相关的节点
的基类。 -
BitClass
class是CharProperty
类的子类,它使用boolean []
数组来加速Latin-1字符的匹配(代码点< = 255)。它有一个add
方法,允许在编译期间添加字符。 -
CharProperty.complement
,Pattern.union
,Pattern.intersection
是与set操作对应的方法。他们所做的是不言自明的。 -
Pattern.setDifference
是不对称集合差异。
Pattern
class compiles aString
into a chain of nodes, each node is in charge of a small and well-defined responsibility, and delegates the work to the next node in the chain.Node
class is the base class of all the nodes.CharProperty
class is the base class of all character-class relatedNode
s.BitClass
class is a subclass ofCharProperty
class that uses aboolean[]
array to speed up matching for Latin-1 characters (code point <= 255). It has anadd
method, which allows characters to be added during compilation.CharProperty.complement
,Pattern.union
,Pattern.intersection
are methods corresponding to set operations. What they do is self-explanatory.Pattern.setDifference
is asymmetric set difference.
在查看 CharProperty clazz(boolean consume)
方法的完整代码之前,这是负责解析a的方法字符类,让我们看看代码的极简化版本,以了解代码的流程:
Before looking at the full code of CharProperty clazz(boolean consume)
method, which is the method responsible for parsing a character class, let us look at an extremely simplified version of the code to understand the flow of the code:
private CharProperty clazz(boolean consume) {
// [Declaration and initialization of local variables - OMITTED]
BitClass bits = new BitClass();
int ch = next();
for (;;) {
switch (ch) {
case '^':
// Negates if first char in a class, otherwise literal
if (firstInClass) {
// [CODE OMITTED]
ch = next();
continue;
} else {
// ^ not first in class, treat as literal
break;
}
case '[':
// [CODE OMITTED]
ch = peek();
continue;
case '&':
// [CODE OMITTED]
continue;
case 0:
// [CODE OMITTED]
// Unclosed character class is checked here
break;
case ']':
// [CODE OMITTED]
// The only return statement in this method
// is in this case
break;
default:
// [CODE OMITTED]
break;
}
node = range(bits);
// [CODE OMITTED]
ch = peek();
}
}
代码基本上读取输入(输入 String
转换为以null结尾 int []
代码点),直至达到]
或String的结尾(未闭合的字符类)。
The code basically reads the input (the input String
converted to null-terminated int[]
of code points) until it hits ]
or the end of the String (unclosed character class).
代码与继续
和打破
在开关
块中混合在一起。但是,只要您意识到 continue
属于外部 for
循环并且 break
属于开关
块,代码很容易理解:
The code is a bit confusing with continue
and break
mixing together inside the switch
block. However, as long as you realize that continue
belongs to the outer for
loop and break
belongs to the switch
block, the code is easy to understand:
- 以
结尾的案例
将永远不会在开关
语句后执行代码。 - 以
break
结尾的个案可能会在开关
语句后执行代码(如果不是返回
已经)。
- Cases ending in
continue
will never execute the code after theswitch
statement. - Cases ending in
break
may execute the code after theswitch
statement (if it doesn'treturn
already).
通过上面的观察,我们可以看到每当一个字符被发现是非特殊的,应该包含在字符类中,我们将在开关
语句之后执行代码,其中 node = range(bits);
是第一个语句。
With the observation above, we can see that whenever a character is found to be non-special and should be included in the character class, we will execute the code after the switch
statement, in which node = range(bits);
is the first statement.
如果你检查源代码,方法 CHARP roperty range(BitClass位)
解析字符类中的单个字符或字符范围。该方法返回传入的相同 BitClass
对象(添加了新字符)或返回 CharProperty
类的新实例。
If you check the source code, the method CharProperty range(BitClass bits)
parses "a single character or a character range in a character class". The method either returns the same BitClass
object passed in (with new character added) or return a new instance of CharProperty
class.
接下来,让我们看一下代码的完整版本(包含部分解析字符类交集&&
省略):
Next, let us look at the full version of the code (with the part parsing character class intersection &&
omitted):
private CharProperty clazz(boolean consume) {
CharProperty prev = null;
CharProperty node = null;
BitClass bits = new BitClass();
boolean include = true;
boolean firstInClass = true;
int ch = next();
for (;;) {
switch (ch) {
case '^':
// Negates if first char in a class, otherwise literal
if (firstInClass) {
if (temp[cursor-1] != '[')
break;
ch = next();
include = !include;
continue;
} else {
// ^ not first in class, treat as literal
break;
}
case '[':
firstInClass = false;
node = clazz(true);
if (prev == null)
prev = node;
else
prev = union(prev, node);
ch = peek();
continue;
case '&':
// [CODE OMITTED]
// There are interesting things (bugs) here,
// but it is not relevant to the discussion.
continue;
case 0:
firstInClass = false;
if (cursor >= patternLength)
throw error("Unclosed character class");
break;
case ']':
firstInClass = false;
if (prev != null) {
if (consume)
next();
return prev;
}
break;
default:
firstInClass = false;
break;
}
node = range(bits);
if (include) {
if (prev == null) {
prev = node;
} else {
if (prev != node)
prev = union(prev, node);
}
} else {
if (prev == null) {
prev = node.complement();
} else {
if (prev != node)
prev = setDifference(prev, node);
}
}
ch = peek();
}
}
查看中的代码案例'[':
的开关
语句和开关
语句后面的代码:
Looking at the code in case '[':
of the switch
statement and the code after the switch
statement:
-
节点
变量存储解析单元的结果(独立字符,字符范围,速记字符类,POSIX / Unicode字符类或嵌套字符类) -
prev
变量存储到目前为止的编译结果,并且在我们在节点
中编译单元后立即更新。
- The
node
variable stores the result of parsing a unit (a standalone character, a character range, a shorthand character class, a POSIX/Unicode character class or a nested character class) - The
prev
variable stores the compilation result so far, and is always updated right after we compiles a unit innode
.
由于局部变量 boolean include
,它记录了字符类是否被否定,从未传递对于任何方法调用,它只能在此方法中单独执行。并且包含
的唯一位置是在开关
语句之后读取和处理。
Since the local variable boolean include
, which records whether the character class is negated, is never passed to any method call, it can only be acted upon in this method alone. And the only place include
is read and processed is after the switch
statement.
这篇关于正则表达式字符类的双重否定中的错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!