使用Java regexp匹配(例如)Unicode字母 [英] Matching (e.g.) a Unicode letter with Java regexps

查看:539
本文介绍了使用Java regexp匹配(例如)Unicode字母的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

StackOverflow上有很多问题和答案,假设字母可以在正则表达式中与 [a-zA-Z] 匹配。然而,使用Unicode时,大多数人会将更多字符视为一个字母(所有希腊字母,Cyrllic ......以及更多字符。 Unicode定义了许多块,每个块可能有字母。

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".

Java定义定义Posix类以查找字母字符等内容,但指定仅适用于美国-ASCII。预定义的字符类定义由 [a-zA-Z_0-9] 组成的单词,它也排除了许多字母。

The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9], which also excludes many letters.

那么你如何正确匹配Unicode字符串呢?还有其他一些库可以做到这一点吗?

So how do you properly match against Unicode strings? Is there some other library that gets this right?

推荐答案

这里有一个非常好的解释:

Here you have a very nice explanation:

http:// www.regular-expressions.info/unicode.html

有些喜欢nts:

遗憾的是,Java和.NET不支持 \ X (还)。使用 \P {M} \p {M} * 作为替代。要匹配任意数量的字素,请使用(?:\ P {M} \p {M} *)+ 而不是 \ X +

"Java and .NET unfortunately do not support \X (yet). Use \P{M}\p{M}* as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+ instead of \X+."

在Java中,正则表达式令牌 \ uFFFF 仅匹配指定的代码点,即使你打开规范等价。但是,相同的语法 \\\FFFF 也用于将Unicode字符插入Java源代码中的文字字符串中。 Pattern.compile(\ u00E0)将匹配à,而 Pattern.compile(\\\\00E0)仅匹配单代码点版本。请记住,在将正则表达式编写为Java字符串文字时,必须对反斜杠进行转义。以前的Java代码编译正则表达式à,而后者编译 \\\à 。根据你正在做的事情,差异可能很大。

"In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant."

这篇关于使用Java regexp匹配(例如)Unicode字母的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆