Java正则表达式中\w和\ b的Unicode等价物? [英] Unicode equivalents for \w and \b in Java regular expressions?

查看:158
本文介绍了Java正则表达式中\w和\ b的Unicode等价物?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多现代正则表达式实现将 \w 字符类简写解释为任何字母,数字或连接标点符号(通常为下划线)。这样,像 \w + 这样的正则表达式匹配像 hello élèveGOÄ_432gefräßig

Many modern regex implementations interpret the \w character class shorthand as "any letter, digit, or connecting punctuation" (usually: underscore). That way, a regex like \w+ matches words like hello, élève, GOÄ_432 or gefräßig.

不幸的是,Java没有。在Java中, \ w 仅限于 [A-Za-z0-9 _] 。除了其他问题之外,这使得上面提到的匹配单词变得困难。

Unfortunately, Java doesn't. In Java, \w is limited to [A-Za-z0-9_]. This makes matching words like those mentioned above difficult, among other problems.

还出现 \ b 单词分隔符匹配不应该的地方。

It also appears that the \b word separator matches in places where it shouldn't.

类似.NET的,支持Unicode的 \w 或<$的正确等价物是什么? Java中的c $ c> \b ?哪些其他快捷方式需要重写以使它们能够识别Unicode?

What would be the correct equivalent of a .NET-like, Unicode-aware \w or \b in Java? Which other shortcuts need "rewriting" to make them Unicode-aware?

推荐答案

源代码



我在下面讨论的重写函数的源代码 可在此处获取

Source code

The source code for the rewriting functions I discuss below is available here.

Sun已更新模式 JDK7的类有一个奇妙的新标志, UNICODE_CHARACTER_CLASS ,这使得一切正常。它可用作模式内部的可嵌入(?U),因此您可以将它与 String 类的包装器一起使用也是。它也体现了各种其他属性的修正定义。它现在跟踪 RL1.2 UTS#18:Unicode正则表达式的//unicode.org/reports/tr18/#Compatibility_Properties\"rel =noreferrer> RL1.2a 。这是一个令人兴奋和戏剧性的改进,开发团队因这项重要工作而受到赞扬。

Sun’s updated Pattern class for JDK7 has a marvelous new flag, UNICODE_CHARACTER_CLASS, which makes everything work right again. It’s available as an embeddable (?U) for inside the pattern, so you can use it with the String class’s wrappers, too. It also sports corrected definitions for various other properties, too. It now tracks The Unicode Standard, in both RL1.2 and RL1.2a from UTS#18: Unicode Regular Expressions. This is an exciting and dramatic improvement, and the development team is to be commended for this important effort.

Java正则表达式的问题在于Perl 1.0 charclass转义 - 意味着 \w \b \ s \d 及其补充 - 不在Java扩展中使用Unicode。除此之外, \ b 享有某些扩展语义,但这些语义都不会映射到 \ w ,也不是 Unicode标识符,也不是 Unicode换行符属性

The problem with Java regexes is that the Perl 1.0 charclass escapes — meaning \w, \b, \s, \d and their complements — are not in Java extended to work with Unicode. Alone amongst these, \b enjoys certain extended semantics, but these map neither to \w, nor to Unicode identifiers, nor to Unicode line-break properties.

此外,Java中的POSIX属性以这种方式访问​​:

Additionally, the POSIX properties in Java are accessed this way:

POSIX syntax    Java syntax

[[:Lower:]]     \p{Lower}
[[:Upper:]]     \p{Upper}
[[:ASCII:]]     \p{ASCII}
[[:Alpha:]]     \p{Alpha}
[[:Digit:]]     \p{Digit}
[[:Alnum:]]     \p{Alnum}
[[:Punct:]]     \p{Punct}
[[:Graph:]]     \p{Graph}
[[:Print:]]     \p{Print}
[[:Blank:]]     \p{Blank}
[[:Cntrl:]]     \p{Cntrl}
[[:XDigit:]]    \p{XDigit}
[[:Space:]]     \p{Space}

这真是一团糟,因为它意味着事情喜欢 Alpha 降低空间执行不是在Java中映射到Unicode 字母小写空白属性。这太令人讨厌了。 Java的Unicode属性支持是严格的antemillennial ,我的意思是它支持在过去十年中没有出现的Unicode属性。

This is a real mess, because it means that things like Alpha, Lower, and Space do not in Java map to the Unicode Alphabetic, Lowercase, or Whitespace properties. This is exceeedingly annoying. Java’s Unicode property support is strictly antemillennial, by which I mean it supports no Unicode property that has come out in the last decade.

无法正确谈论空白是非常烦人的。请考虑下表。对于每个代码点,Java都有一个J-results列
,Perl或任何其他基于PCRE的regex引擎都有一个P-results列:

Not being able to talk about whitespace properly is super-annoying. Consider the following table. For each of those code points, there is both a J-results column for Java and a P-results column for Perl or any other PCRE-based regex engine:

             Regex    001A    0085    00A0    2029
                      J  P    J  P    J  P    J  P
                \s    1  1    0  1    0  1    0  1
               \pZ    0  0    0  0    1  1    1  1
            \p{Zs}    0  0    0  0    1  1    0  0
         \p{Space}    1  1    0  1    0  1    0  1
         \p{Blank}    0  0    0  0    0  1    0  0
    \p{Whitespace}    -  1    -  1    -  1    -  1
\p{javaWhitespace}    1  -    0  -    0  -    1  -
 \p{javaSpaceChar}    0  -    0  -    1  -    1  -

看到了吗?

根据Unicode,几乎每个Java空白结果都是错误的。这是一个非常大的问题。 Java只是搞砸了,根据现有的做法和Unicode也给出了错误的答案。 Plus Java甚至不能让您访问真正的Unicode属性!实际上,Java不支持与Unicode空白对应的任何属性。

Virtually every one of those Java white space results is   ̲w̲r̲o̲n̲g̲  according to Unicode. It’s a really big problem. Java is just messed up, giving answers that are "wrong" according to existing practice and also according to Unicode. Plus Java doesn’t even give you access to the real Unicode properties! In fact, Java does not support any property that corresponds to Unicode whitespace.

为了解决这个问题和许多其他相关问题,昨天我写了一个Java函数来重写一个模式字符串,它重写了这14个charclass转义符:

To deal with this and many other related problems, yesterday I wrote a Java function to rewrite a pattern string that rewrites these 14 charclass escapes:

\w \W \s \S \v \V \h \H \d \D \b \B \X \R

将它们替换为实际上以可预测和一致的方式匹配Unicode的东西。它只是来自单个hack会话的alpha原型,但它完全正常运行。

by replacing them with things that actually work to match Unicode in a predictable and consistent fashion. It’s only an alpha prototype from a single hack session, but it is completely functional.

简短的故事是我的代码重写了以下14个:

The short story is that my code rewrites those 14 as follows:

\s => [\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
\S => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

\v => [\u000A-\u000D\u0085\u2028\u2029]
\V => [^\u000A-\u000D\u0085\u2028\u2029]

\h => [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]
\H => [^\u0009\u0020\u00A0\u1680\u180E\u2000\u2001-\u200A\u202F\u205F\u3000]

\w => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
\W => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]

\b => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
\B => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

\d => \p{Nd}
\D => \P{Nd}

\R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

\X => (?>\PM\pM*)

需要考虑的一些事项......

Some things to consider...


  • 用于 \X 定义 Unicode现在将称为遗留字形集群,而不是扩展字形集群,因为后者更复杂。 Perl本身现在使用的是版本较高的版本,但旧版本仍适用于最常见的情况。 编辑:请参阅底部的附录。

  • That uses for its \X definition what Unicode now refers to as a legacy grapheme cluster, not an extended grapheme cluster, as the latter is rather more complicated. Perl itself now uses the fancier version, but the old version is still perfectly workable for the most common situations. See addendum at bottom.

如何处理 \d 取决于您的意图,但默认值是Uniode定义。我可以看到人们并不总是想要 \p {Nd} ,但有时要么 [0-9] \ pN

What to do about \d depends on your intent, but the default is the Uniode definition. I can see people not always wanting \p{Nd}, but sometimes either [0-9] or \pN.

两个边界定义, \ b \B ,专门用于使用 \w 定义。

The two boundary definitions, \b and \B, are specifically written to use the \w definition.

那个 \w 的定义过于宽泛,因为它抓住的是不仅仅是带圆圈的字母。在JDK7之前,Unicode Other_Alphabetic 属性不可用,因此这是您可以做的最好的。

That \w definition is overly broad, because it grabs the parenned letters not just the circled ones. The Unicode Other_Alphabetic property isn’t available until JDK7, so that’s the best you can do.

自从Larry Wall第一次创造出来以来,边界一直存在问题code> \b 和 \B 1987年为Perl 1.0讨论它们的语法。理解如何<的关键code> \b 和 \B 这两项工作都是为了消除关于他们的两个普遍的误解:

Boundaries have been a problem ever since Larry Wall first coined the \b and \B syntax for talking about them for Perl 1.0 back in 1987. The key to understanding how \b and \B both work is to dispel two pervasive myths about them:


  1. 对于 \w 单词字符, 只看 永远非字字符。

  2. 他们不专门寻找字符串的边缘。

  1. They are only ever looking for \w word characters, never for non-word characters.
  2. They do not specifically look for the edge of the string.

A \b 边界意味着:

    IF does follow word
        THEN doesn't precede word
    ELSIF doesn't follow word
        THEN does precede word

这些都完全直截了当地定义为:

And those are all defined perfectly straightforwardly as:


  • 跟随单词(?< = \w)

  • 在单词之前是( ?= \ w)

  • 不跟着字是(?<!\\ \\ w)

  • 不在单词之前是(?!\ w)

  • follows word is (?<=\w).
  • precedes word is (?=\w).
  • doesn’t follow word is (?<!\w).
  • doesn’t precede word is (?!\w).

因此,因为 IF-THEN 已被编码在正则表达式中,一起编辑 AB X | Y ,因为的优先级高于,这只是 AB | CD 。所以每个 \b 这意味着边界可以安全地替换为:

Therefore, since IF-THEN is encoded as an and ed-together AB in regexes, an or is X|Y, and because the and is higher in precedence than or, that is simply AB|CD. So every \b that means a boundary can be safely replaced with:

    (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))

以适当的方式定义 \w

(您可能认为奇怪的是 A C AB | D ,但有一段时间我正在追逐Unicode属性中的互斥矛盾 - 我认为我已经处理好了,但是为了以防万一,我在边界中留下了双重条件。如果你以后得到额外的想法,这会使它更具可扩展性。)

(You might think it strange that the A and C components are opposites. In a perfect world, you should be able to write that AB|D, but for a while I was chasing down mutual exclusion contradictions in Unicode properties — which I think I’ve taken care of, but I left the double condition in the boundary just in case. Plus this makes it more extensible if you get extra ideas later.)

对于 \B 非边界,逻辑是:

    IF does follow word
        THEN does precede word
    ELSIF doesn't follow word
        THEN doesn't precede word

允许将 \B 的所有实例替换为:

Allowing all instances of \B to be replaced with:

    (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))

这真的是如何 \ b \B 表现得很好。它们的等效模式是

This really is how \b and \B behave. Equivalent patterns for them are


  • \ b 使用((IF)THEN | ELSE) construct是(?(?< = \w)(?!\ w)|(?= \ w) )

  • \B 使用((IF)那么| ELSE ) construct是(?(?= \ w)(?< = \w)|(?<!\w))

  • \b using the ((IF)THEN|ELSE) construct is (?(?<=\w)(?!\w)|(?=\w))
  • \B using the ((IF)THEN|ELSE) construct is (?(?=\w)(?<=\w)|(?<!\w))

但只有 AB | CD 的版本很好,特别是如果你的正则表达式语言缺乏条件模式 - 比如Java。 ☹

But the versions with just AB|CD are fine, especially if you lack conditional patterns in your regex language — like Java. ☹

我已经使用所有三个等效定义验证了边界的行为,测试套件每次运行检查110,385,408个匹配,并且我在十几种不同的数据配置根据:

I’ve already verified the behaviour of the boundaries using all three equivalent definitions with a test suite that checks 110,385,408 matches per run, and which I've run on a dozen different data configurations according to:

     0 ..     7F    the ASCII range
    80 ..     FF    the non-ASCII Latin1 range
   100 ..   FFFF    the non-Latin1 BMP (Basic Multilingual Plane) range
 10000 .. 10FFFF    the non-BMP portion of Unicode (the "astral" planes)

但是,人们通常需要不同的边界。他们想要一些空白和字符串边缘感知的东西:

However, people often want a different sort of boundary. They want something that is whitespace and edge-of-string aware:


  • 左边(?:(?< = ^)|(?< = \ s))

  • 右边缘 as (?= $ | \ s)

  • left edge as (?:(?<=^)|(?<=\s))
  • right edge as (?=$|\s)

我在我的另一个答案提供了这个和其他一些便利。这包括自然语言单词,破折号,连字符和撇号的定义,以及更多。

The code I posted in my other answer provides this and quite a few other conveniences. This includes definitions for natural-language words, dashes, hyphens, and apostrophes, plus a bit more.

它还允许您在逻辑代码点中指定Unicode字符,而不是在愚蠢的UTF-16代理人。 很难过分强调它的重要性!而这仅仅是字符串扩展。

It also allows you to specify Unicode characters in logical code points, not in idiotic UTF-16 surrogates. It’s hard to overstress how important that is! And that’s just for the string expansion.

对于regex charclass替换,使得你的Java正则表达式中的charclass 最终在Unicode上工作,并正常工作,抓住 此处的完整资源您当然,可以随意使用它。如果你修复它,我很想听到它,但你没有必要。它很短。主正则表达式重写函数的内容很简单:

For regex charclass substitution that makes the charclass in your Java regexes finally work on Unicode, and work correctly, grab the full source from here. You may do with it as you please, of course. If you make fixes to it, I’d love to hear of it, but you don’t have to. It’s pretty short. The guts of the main regex rewriting function is simple:

switch (code_point) {

    case 'b':  newstr.append(boundary);
               break; /* switch */
    case 'B':  newstr.append(not_boundary);
               break; /* switch */

    case 'd':  newstr.append(digits_charclass);
               break; /* switch */
    case 'D':  newstr.append(not_digits_charclass);
               break; /* switch */

    case 'h':  newstr.append(horizontal_whitespace_charclass);
               break; /* switch */
    case 'H':  newstr.append(not_horizontal_whitespace_charclass);
               break; /* switch */

    case 'v':  newstr.append(vertical_whitespace_charclass);
               break; /* switch */
    case 'V':  newstr.append(not_vertical_whitespace_charclass);
               break; /* switch */

    case 'R':  newstr.append(linebreak);
               break; /* switch */

    case 's':  newstr.append(whitespace_charclass);
               break; /* switch */
    case 'S':  newstr.append(not_whitespace_charclass);
               break; /* switch */

    case 'w':  newstr.append(identifier_charclass);
               break; /* switch */
    case 'W':  newstr.append(not_identifier_charclass);
               break; /* switch */

    case 'X':  newstr.append(legacy_grapheme_cluster);
               break; /* switch */

    default:   newstr.append('\\');
               newstr.append(Character.toChars(code_point));
               break; /* switch */

}
saw_backslash = false;

无论如何,这段代码只是一个alpha版本,我在周末被黑了。它不会保持这种状态。

Anyway, that code is just an alpha release, stuff I hacked up over the weekend. It won’t stay that way.

对于测试我打算:


  • 将代码重复折叠起来

  • fold together the code duplication

提供关于unescaping字符串转义与增强正则表达式转义的更清晰的界面

provide a clearer interface regarding unescaping string escapes versus augmenting regex escapes

\d 扩展中提供一些灵活性,也许 \b

provide some flexibility in the \d expansion, and maybe the \b

提供方便的方法来处理转身并调用Pattern.compile或String.matches或诸如此类的东西

provide convenience methods that handle turning around and calling Pattern.compile or String.matches or whatnot for you

对于生产版本,它应该有javadoc和JUnit测试套件。我可以包括我的gigatester,但它不是作为JUnit测试编写的。

For production release, it should have javadoc and a JUnit test suite. I may include my gigatester, but it’s not written as JUnit tests.

我有好消息和坏消息。

I have good news and bad news.

好消息是,我现在有一个非常接近扩展字形集群来使用改进 \ X

The good news is that I’ve now got a very close approximation to an extended grapheme cluster to use for an improved \X.

坏消息是该模式是:

(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))

在Java中你写的是:

which in Java you’d write as:

String extended_grapheme_cluster = "(?:(?:\\u000D\\u000A)|(?:[\\u0E40\\u0E41\\u0E42\\u0E43\\u0E44\\u0EC0\\u0EC1\\u0EC2\\u0EC3\\u0EC4\\uAAB5\\uAAB6\\uAAB9\\uAABB\\uAABC]*(?:[\\u1100-\\u115F\\uA960-\\uA97C]+|([\\u1100-\\u115F\\uA960-\\uA97C]*((?:[[\\u1160-\\u11A2\\uD7B0-\\uD7C6][\\uAC00\\uAC1C\\uAC38]][\\u1160-\\u11A2\\uD7B0-\\uD7C6]*|[\\uAC01\\uAC02\\uAC03\\uAC04])[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]*))|[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]+|[^[\\p{Zl}\\p{Zp}\\p{Cc}\\p{Cf}&&[^\\u000D\\u000A\\u200C\\u200D]]\\u000D\\u000A])[[\\p{Mn}\\p{Me}\\u200C\\u200D\\u0488\\u0489\\u20DD\\u20DE\\u20DF\\u20E0\\u20E2\\u20E3\\u20E4\\uA670\\uA671\\uA672\\uFF9E\\uFF9F][\\p{Mc}\\u0E30\\u0E32\\u0E33\\u0E45\\u0EB0\\u0EB2\\u0EB3]]*)|(?s:.))";

¡Tschüß!

这篇关于Java正则表达式中\w和\ b的Unicode等价物?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆