java正则表达式支持非ascii值? [英] java regex support for non-ascii values?

查看:106
本文介绍了java正则表达式支持非ascii值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个当前的方法可以清除不是字母或空格的字符,这只是

We have a current method which clears out chars that are not alphabetic or whitespace which is simply

String clean(String input)
{
   return input==null?"":input.replaceAll("[^a-zA-Z ]","");
}

真的应该修复以支持非英语字符(例如ś, ũ,......)。不幸的是java正则表达式类(例如
\ W - 非单词字符,
\p {Alpha} - 仅限美国ASCII}。)似乎不支持这个。有没有办法用java正则表达式而不是手动循环每个字符来测试它?

which really ought to be fixed to support non-english chars (e.g. ś,ũ, ... ). Unfortunately the java regex classes (e.g. "\W" -A non-word character, "\p{Alpha}" -US-ASCII only}. ) don't seem to support this. Is there a way of doing this with java regex rather than looping manually though each character to test it?

推荐答案

Java 6模式句柄Unicode,请参阅此文档

Java 6 Pattern handles Unicode, see this doc.


Java源代码中的
\ u2014等Unicode转义序列是按照§3.3中的描述处理的

Java语言规范。这样的
转义序列也是由正则表达式
解析器直接实现
,因此Unicode转义可以是
,用于从
文件或从
文件读取的表达式中键盘。因此
字符串\ u2014和\\\202014,而
不相等,编译成相同的
模式,该模式匹配具有十六进制值的字符
0x2014。

Unicode escape sequences such as \u2014 in Java source code are processed as described in §3.3 of the Java Language Specification. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

Unicode块和类别是使用\p和\P编写的
构造
,如Perl中所示。如果
输入具有属性prop,则\p {prop}匹配,而如果输入
具有该属性,则
\P {prop}不匹配。块为
,前缀为In,如
InMongolian。类别可以是使用可选前缀指定的

\p {L}和\p {IsL}都表示
类别的Unicode字母。阻止
和类别可以在
内和字符类之外使用。

Unicode blocks and categories are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property. Blocks are specified with the prefix In, as in InMongolian. Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and categories can be used both inside and outside of a character class.

这篇关于java正则表达式支持非ascii值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆