汉字UTF编码 [英] UTF Encoding for Chinese CharactersJava
问题描述
任何想法可能会导致你好吗成为C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297?我做了一个Google搜索,但是我得到的是一个中文网站,描述了在python中发生的问题。任何见解会很棒,谢谢!
你有什么是双编码。
您有正确指出的三个字符序列你好吗以UTF-8编码为E4BDA0 E5A5BD E59097。
但现在,开始编码UTF-8中THAT编码的每个字节。从E4开始。 UTF-8中的代码点是什么?尝试一下!它是C3 A4!
你得到的想法....: - )
这是一个Java应用程序这说明了这一点:
public class DoubleEncoding {
public static void main(String [] args)throws Exception {
byte [] encoding1 =你好吗.getBytes(UTF-8);
String string1 = new String(encoding1,ISO8859-1);
for(byte b:encoding1){
System.out.printf(%2x,b);
}
System.out.println();
byte [] encoding2 = string1.getBytes(UTF-8);
for(byte b:encoding2){
System.out.printf(%2x,b);
}
System.out.println();
}
}
I am receiving a String via an object from an axis webservice. Because I'm not getting the string I expected, I did a check by converting the string into bytes and I get C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 in hexa, when I'm expecting E4BDA0 E5A5BD E59097 which is actually 你好吗 in UTF-8.
Any ideas what might be causing 你好吗 to become C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297? I did a Google search but all I got was a chinese website describing a problem that happens in python. Any insights will be great, thanks!
You have what is known as a double encoding.
You have the three character sequence "你好吗" which you correctly point out is encoded in UTF-8 as E4BDA0 E5A5BD E59097.
But now, start encoding each byte of THAT encoding in UTF-8. Start with E4. What is that codepoint in UTF-8? Try it! It's C3 A4!
You get the idea.... :-)
Here is a Java app which illustrates this:
public class DoubleEncoding {
public static void main(String[] args) throws Exception {
byte[] encoding1 = "你好吗".getBytes("UTF-8");
String string1 = new String(encoding1, "ISO8859-1");
for (byte b : encoding1) {
System.out.printf("%2x ", b);
}
System.out.println();
byte[] encoding2 = string1.getBytes("UTF-8");
for (byte b : encoding2) {
System.out.printf("%2x ", b);
}
System.out.println();
}
}
这篇关于汉字UTF编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!