Java 替换字符串中的 Unicode 字符 [英] Java Replace Unicode Characters in a String

查看:98
本文介绍了Java 替换字符串中的 Unicode 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多个 unicode 字符的字符串.我想识别所有这些 un​​icode 字符,例如:\ uF06C,并用反斜杠和四个不含 "u" 的十六进制数字替换它.

I have a string which contains multiple unicode characters. I want to identify all these unicode characters, ex: \ uF06C, and replace it with a back slash and four hexa digits without "u" in it.

示例:

源字符串:添加 \uF06Cd1 子句"

结果字符串:添加 \F06Cd1 子句"

如何在 Java 中实现这一点?

How can achieve this in Java?

链接中的问题 Java 正则表达式 - 如何替换模式或如何与此不同,因为我的问题涉及Unicode字符.虽然它有多个字面量,但它被 jvm 认为是一个单一的字符,因此正则表达式不起作用.

Question in link Java Regex - How to replace a pattern or how to is different from this as my question deals with unicode character. Though it has multiple literals, it is considered as one single character by jvm and hence regex won't work.

推荐答案

正确的方法是使用正则表达式来匹配整个 unicode 定义并使用组替换.

The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.

匹配unicode-string的正则表达式:

The regex to match the unicode-string:

一个 unicode 字符看起来像 \uABCD,所以 \u 后面跟着一个 4 字符的十六进制字符串.匹配这些可以使用

A unicode-character looks like \uABCD, so \u, followed by a 4-character hexnumber string. Matching these can be done using

\\u[A-Fa-f\d]{4}

但这有一个问题:
在像只是一些 \\uabcd 任意文本"这样的 String 中,\u 仍然会匹配.所以我们需要确保 \u 前面有偶数个 \ :

But there's a problem with this:
In a String like "just some \\uabcd arbitrary text" the \u would still get matched. So we need to make sure the \u is preceeded by an even number of \s:

(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}

现在作为输出,我们需要一个反斜杠后跟十六进制部分.这可以通过分组替换来完成,所以让我们从分组字符开始:

Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:

(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})

作为替换,我们希望匹配两个反斜杠的组中的所有反斜杠,后跟一个反斜杠和 unicode-literal 的十六进制部分:

As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:

$1\\$3

现在是实际代码:

String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";

Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);

这是很多反斜杠!好吧,java、regex 和反斜杠存在一个问题:需要在 java regex 中对反斜杠进行转义.所以\\\\"作为java中的模式字符串匹配一个\作为正则表达式匹配的字符.

That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\" as a pattern-string in java matches one \ as regex-matched character.


在实际的字符串上,需要过滤掉字符并用它们的整数表示替换:


On actual strings, the characters need to be filtered out and be replaced by their integer-representation:

StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
   if(c > 127)
       sb.append("\\").append(String.format("%04x", (int) c));
   else
       sb.append(c);

这假定unicode-character"是指非 ASCII 字符.此代码将按原样打印任何 ASCII 字符,并将所有其他字符输出为反斜杠,后跟其 unicode 代码.unicode-character"的定义相当模糊,因为java中的char总是代表unicode-characters.这种方法保留了任何控制字符,如\n"、\r"等,这就是我选择它而不是其他定义的原因.

This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char in java always represents unicode-characters. This approach preserves any control-chars like "\n", "\r", etc., which is why I chose it over other definitions.

这篇关于Java 替换字符串中的 Unicode 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆