在java中utf-8解码 [英] utf-8 decoding in java
问题描述
我试图将参数从PHP中间层传递给理解J2EE的java后端。我在Groovy中编写控制器代码。在那里,我试图解码一些可能包含国际字符的参数。
我真的很困惑我调试这个问题到目前为止的结果,因此我希望与你分享,希望有人能够正确解释我的结果。
为了我的小测试,参数I' m通过是déjeuner。只要确定,System.out.println(déjeuner)正确地给了我:
déjeuner
在控制台中
以下是char / dec和hex原始字符串的每个字符的值:
next char:d 100 64
next char:? -61 c3
下一个char:? -87 a9
下一个char:j 106 6a
下一个char:e 101 65
下一个char:u 117 75
下一个char:n 110 6e
下一个char: e 101 65
下一个char:r 114 72
请注意,UTF- 8是希望的角色: http://www.fileformat.info /info/unicode/char/00e9/index.htm
现在,如果我尝试以UTF-8字符串的形式读取此字符串,就像在stmt中一样。 getBytes(UTF-8),我突然结束了一个11字节的序列,如下所示:
64 c3 83 c2 a9 6a 65 75 6e 65 72
而stmt.getBytes(iso-8859-1)给出我9个字节:
64 c3 a9 6a 65 75 6e 65 72
pre>
请注意c3a9序列!
现在,如果我尝试将UTF-8序列转换为UTF -8,如
new String(stmt.getBytes(UTF-8),UTF-8);
我得到:
next char:d 100 64
next char:? -61 c3
下一个char:? -87 a9
下一个char:j 106 6a
下一个char:e 101 65
下一个char:u 117 75
下一个char:n 110 6e
下一个char: e 101 65
下一个字符:r 114 72
注意c3a9序列
而
new String(stmt.getBytes(iso-8859-1 ),UTF-8)
结果:
next char:d 100 64
next char:? -23 e9
下一个char:j 106 6a
下一个char:e 101 65
下一个char:u 117 75
下一个char:n 110 6e
下一个char: e 101 65
下一个char:r 114 72
注意在utf-8中的e9 (和ascii)再一次是我所期待的'é'字符。
不幸的是,在任何情况下,我都不会使用一个合适的字符串显示像文字字符串déjeuner。奇怪的是,字节序列似乎都是正确的。
解决方案当处理字符串时,请记住:
字节
!=char
。所以在你的第一个例子中,你有char c3
,而不是字节c3
这是一个很大的区别: code> byte 将是UTF-8序列的一部分,但char
已经是Unicode 。所以当你将它转换成UTF-8时,Unicode字符c3
必须成为字节
序列c3 83
。
所以问题是:你如何获得String?在该代码中必须有一个错误,它不能正确处理UTF-8编码的
字节
序列。
为什么
ISO-8859-1
通常工作的原因是,该编码不会修改任何char
与代码点< ; 256(即0到255之间的任何东西),因此UTF-8编码的字节
序列将不被修改。
你的最后一个例子也是错误的:
char e9
是在ISO-8859-1
和Unicode中的é。在UTF-8中,它不是有效的,因为它不是一个字节
,因为它是字节c3
前缀缺失。也就是说,它正确地表示您寻求的Unicode字符串。I'm trying to pass parameters from a PHP middle tier to a java backend that understands J2EE. I'm writing the controller code in Groovy. In there, I'm trying to decode some parameter that will likely contain international characters.
I am really puzzled by the results of my debugging this problem so far, hence I wanted to share it with you in the hope that someone will be able to give the correct interpretation of my results.
For the sake of my little test, the parameter I'm passing is "déjeuner". Just to be sure, System.out.println("déjeuner") correctly gives me:
déjeuner
in the console
Now following are the char/dec and hex values of each char of the original string:
next char: d 100 64 next char: ? -61 c3 next char: ? -87 a9 next char: j 106 6a next char: e 101 65 next char: u 117 75 next char: n 110 6e next char: e 101 65 next char: r 114 72
note that the c3a9 sequence in UTF-8 is the wished-for character: http://www.fileformat.info/info/unicode/char/00e9/index.htm
Now if I try to read this string as an UTF-8 string, as in stmt.getBytes("UTF-8"), I suddenly end up having a 11 bytes sequence, as follows:
64 c3 83 c2 a9 6a 65 75 6e 65 72
whereas stmt.getBytes("iso-8859-1") gives me 9 bytes:
64 c3 a9 6a 65 75 6e 65 72
note the c3a9 sequence here!
now if I try to convert the UTF-8 sequence to UTF-8, as in
new String(stmt.getBytes("UTF-8"), "UTF-8");
I get:
next char: d 100 64 next char: ? -61 c3 next char: ? -87 a9 next char: j 106 6a next char: e 101 65 next char: u 117 75 next char: n 110 6e next char: e 101 65 next char: r 114 72
note the c3a9 sequence
while
new String(stmt.getBytes("iso-8859-1"), "UTF-8")
results in:
next char: d 100 64 next char: ? -23 e9 next char: j 106 6a next char: e 101 65 next char: u 117 75 next char: n 110 6e next char: e 101 65 next char: r 114 72
note the e9 which in utf-8 (and ascii) is, again, the 'é' character that I'm longing for.
Unfortunately, in neither case am I ending up with a proper string that would display like the literal string "déjeuner". Strangely enough, the byte sequences both seem correct though.
解决方案When dealing with Strings, always remember:
byte
!=char
. So in your first example, you have thechar c3
, not thebyte c3
which is a huge difference: Thebyte
would be part of the UTF-8 sequence but thechar
already is Unicode. So when you convert that to UTF-8, the Unicode characterc3
must become thebyte
sequencec3 83
.So the question is: How did you get the String? There must be a bug in that code which doesn't properly handle UTF-8 encoded
byte
sequences.The reason why
ISO-8859-1
usually works is that this encoding doesn't modify anychar
with a code point < 256 (i.e. anything between 0 and 255), so UTF-8 encodedbyte
sequences won't be modified.Your last example is also wrong: The
char e9
is é inISO-8859-1
and Unicode. In UTF-8, it's not valid since it's not abyte
and since it's thebyte c3
prefix is missing. That said, it correctly represents the Unicode string you seek.这篇关于在java中utf-8解码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!