Java 8 UTF-8编码问题(java bug?) [英] Java 8 UTF-8 encoding issue (java bug?)
问题描述
运行以下代码:
public static void encodingIssue()throws IOException {
byte [] array = new byte [3];
array [0] =(byte)-19;
array [1] =(byte)-69;
array [2] =(byte)-100;
String str = new String(array,UTF-8); (char c:str.toCharArray())
{
System.out.println((int)c);
}
}
在Java 1.8.0_20(及更早版本)中,结果
65533
57052
您是否遇到此错误?是否有解决方法?
这种不一致性也适用于Shift_JIS,JIS_X0212-1990,x-IBM300,x-IBM834,x-IBM942,x-IBM942C, x-JIS0208,但显然UTF-8更为迫切。
它是\"修改的UTF-8 编码来存储替代对(或甚至不配对的该范围的字符),如单个字符。如果解码器声称使用标准的 UTF-8
使用修改的UTF-8,则这是一个错误。这似乎已经用Java 8修复。
您可以使用指定的方法可靠地读取此类数据,以使用修改的UTF-8
ByteBuffer bb = ByteBuffer.allocate(array.length + 2);
bb.putShort((short)array.length).put(array);
ByteArrayInputStream bis = new ByteArrayInputStream(bb.array());
DataInputStream dis = new DataInputStream(bis);
String str = dis.readUTF();
There is an inconsistency when creating a String with UTF-8 encoding.
Run this code:
public static void encodingIssue() throws IOException {
byte[] array = new byte[3];
array[0] = (byte) -19;
array[1] = (byte) -69;
array[2] = (byte) -100;
String str = new String(array, "UTF-8");
for (char c : str.toCharArray()) {
System.out.println((int) c);
}
}
On Java 1.8.0_20 (and earlier versions) we have the result
65533
On Java 1.7 and 1.6 we have the correct result:
57052
Have you encountered this error? Is there a workaround for this?
This inconsistency manifests itself also for Shift_JIS, JIS_X0212-1990, x-IBM300, x-IBM834, x-IBM942, x-IBM942C, x-JIS0208, but obviously UTF-8 is the more urgent.
It is a property of the "Modified UTF-8" encoding to store surrogate pairs (or even unpaired chars of that range) like individual characters. And it’s an error if a decoder claiming to use standard UTF-8
uses "Modified UTF-8". This seems to have been fixed with Java 8.
You can reliably read such data using a method that is specified to use "Modified UTF-8":
ByteBuffer bb=ByteBuffer.allocate(array.length+2);
bb.putShort((short)array.length).put(array);
ByteArrayInputStream bis=new ByteArrayInputStream(bb.array());
DataInputStream dis=new DataInputStream(bis);
String str=dis.readUTF();
这篇关于Java 8 UTF-8编码问题(java bug?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!