为什么代理java regexp发现hypen-minus - [英] Why a surrogate java regexp finds hypen-minus -

查看:84
本文介绍了为什么代理java regexp发现hypen-minus - 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出为什么在<$ c $中使用的JAVA ([\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ c> replaceAll(regexp,)还删除了超负号字符以及代理字符。



这个的Unicode一个是 \ u002d 所以它似乎不在任何这些范围内。



我可以轻松删除此行为,添加&& [^ \ u002d] ,从而产生([\\\\\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\但是,由于我不知道为什么这个 \ u002d 被删除了,我认为可能会有更多未被注意的字符被删除。



示例:

  String text =A\\\-B; 
System.out.println(text);
String regex =([\\\\\\\\\\\\\\\\\\\\
System.out.println(text.replaceAll(regex,X));

打印:

  AB 
AXB


解决方案

概述和假设



在星历平面中匹配字符(代码点U + 10000到U + 10FFFF)是Java正则表达式中记录不足的特性。



这个答案主要涉及Java版本6及更高版本的Oracle实现(参考实现,也用于OpenJDK)。



如果您碰巧使用GNU Classpath或Android,请自行测试代码,因为它们使用自己的实现。



在幕后



假设你在Oracle的实现上运行你的正则表达式,你的正则表达式

 ([\\\ 0000-1] \\ ubbff \\\�-\\\�])

编译如下:

  StartS。启动unanchored match(minLength = 1)
java.util.regex.Pattern $ GroupHead
Pattern.union。 A∪B:
Pattern.union。 A∪B:
Pattern.rangeFor。 U + D800< = codePoint< = U + 10FC00。
BitClass。匹配以下1个字符中的任何一个:
[U + 002D]
SingleS。匹配代码点:U + DFFF LOW SURROGATES DFFF
java.util.regex.Pattern $ GroupTail
java.util.regex.Pattern $ LastNode
节点。接受匹配

字符类被解析为 \\\�-\\\ deadbff \\\� - \\\� 。由于 \\\� \\\� 形成有效的代理对,它代表代码点U + 10FC00。



错误的解决方案



写作没有意义:

 [ \\\�-\\\�] [\\\cc00-\\\\\\\\\\\\\\\\\\\\\\\\\\\按代码点匹配,有效的代理对将在匹配前转换为代码点,上面的正则表达式无法匹配任何内容,因为它正在搜索可以形成有效对的2个连续单独代理项。



解决方案



如果你想匹配并删除星体平面上U + FFFF之上的所有代码点(由有效的代理对形成) ,加上单独的代理人(不能形成有效的代理人对),你应该写:

  input.replaceAll( [\\� \\\�-\\\� \\\� \\\�-\\\ affff],); 

此解决方案已经过测试,可在Java 6和7(Oracle实施)中使用。



上面的正则表达式编译为:

  StartS。开始unanchored match(minLength = 1)
Pattern.union。 A∪B:
Pattern.rangeFor。 U + 10000< = codePoint< = U + 10FFFF。
Pattern.rangeFor。 U + D800< = codePoint< = U + DFFF。
java.util.regex.Pattern $ LastNode
节点。接受匹配

请注意,我使用字符串文字Unicode转义序列指定字符,而不是转义序列正则表达式语法。

  //仅适用于Java 7 
input.replaceAll([\\\\�] \\\\�-\\\\� \\\\� \\\\�-\\\\�],)

当使用正则表达式语法指定时,Java 6无法识别代理项对,因此正则表达式识别 \\\\� 作为一个字符,并尝试编译范围 \\\\�-\\\\� 失败的地方。我们很幸运,它为这个输入抛出了一个例外;否则,错误将无法检测到。 Java 7正确解析此正则表达式并编译为与上面相同的结构。






从Java 7及更高版本开始,语法<已添加code> \ x {h..h} 以支持指定BMP(基本多语言平面)以外的字符,并且建议在星体平面中指定字符。

  input.replaceAll([\\x {10000} -\\ x {10ffff} \\\�-\\ \�],); 

这个正则表达式也编译成与上面相同的结构。


I am trying to find why this regex in JAVA ([\ud800-\udbff\udc00-\udfff]) used in replaceAll(regexp,"") is removing also the hypen-minus character, along with the surrogate characters.

The Unicode for this one is \u002d so it does not seem to be inside any of those ranges.

I could easily remove this behaviour adding &&[^\u002d] resulting in ([\ud800-\udbff\udc00-\udfff&&[^\u002d]])

But, as I do not know why this \u002d is removed, I think there could be more unnoticed chars being removed.

Example:

String text = "A\u002dB";
System.out.println(text);
String regex = "([\ud800-\udbff\udc00-\udfff])";
System.out.println(text.replaceAll(regex, "X"));

prints:

A-B
AXB

解决方案

Overview and assumption

Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.

This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.

Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.

Behind the scene

Assuming that you are running your regex on Oracle's implementation, your regex

"([\ud800-\udbff\udc00-\udfff])"

is compiled as such:

StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
  Pattern.union. A ∪ B:
    Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
    BitClass. Match any of these 1 character(s):
      [U+002D]
  SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match

The character class is parsed as \ud800-\udbff\udc00, -, \udfff. Since \udbff\udc00 forms a valid surrogate pairs, it represent the code point U+10FC00.

Wrong solution

There is no point in writing:

"[\ud800-\udbff][\udc00-\udfff]"

Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.

Solution

If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:

input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");

This solution has been tested to work in Java 6 and 7 (Oracle implementation).

The regex above compiles to:

StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
  Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
  Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match

Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.

// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")

Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800 as one character and tries to compile the range \\udc00-\\udbff where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.


From Java 7 and above, the syntax \x{h..h} has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.

input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");

This regex also compiles to the same structure as above.

这篇关于为什么代理java regexp发现hypen-minus - 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆