Java 8 UTF-8 编码问题(Java 错误?) [英] Java 8 UTF-8 encoding issue (java bug?)

查看:43
本文介绍了Java 8 UTF-8 编码问题(Java 错误?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用UTF-8编码创建字符串时出现不一致.

There is an inconsistency when creating a String with UTF-8 encoding.

运行此代码:

public static void encodingIssue() throws IOException {
    byte[] array = new byte[3];
    array[0] = (byte) -19;
    array[1] = (byte) -69;
    array[2] = (byte) -100;

    String str = new String(array, "UTF-8");
    for (char c : str.toCharArray()) {
        System.out.println((int) c);
    }
}

在 Java 1.8.0_20(及更早版本)上,我们得到了结果

On Java 1.8.0_20 (and earlier versions) we have the result

 65533

在 Java 1.7 和 1.6 上,我们得到了正确的结果:

On Java 1.7 and 1.6 we have the correct result:

 57052

您遇到过这个错误吗?有解决方法吗?

Have you encountered this error? Is there a workaround for this?

这种不一致也体现在 Shift_JIS、JIS_X0212-1990、x-IBM300、x-IBM834、x-IBM942、x-IBM942C、x-JIS0208,但显然 UTF-8 更为紧迫.

This inconsistency manifests itself also for Shift_JIS, JIS_X0212-1990, x-IBM300, x-IBM834, x-IBM942, x-IBM942C, x-JIS0208, but obviously UTF-8 is the more urgent.

推荐答案

它是Modified UTF-8" 编码以像单个字符一样存储代理对(甚至该范围内的未配对字符).如果声称使用标准 UTF-8 的解码器使用Modified UTF-8",这是一个错误.这似乎已在 Java 8 中得到解决.

It is a property of the "Modified UTF-8" encoding to store surrogate pairs (or even unpaired chars of that range) like individual characters. And it’s an error if a decoder claiming to use standard UTF-8 uses "Modified UTF-8". This seems to have been fixed with Java 8.

您可以使用指定使用修改后的 UTF-8"的方法可靠地读取此类数据:

You can reliably read such data using a method that is specified to use "Modified UTF-8":

ByteBuffer bb=ByteBuffer.allocate(array.length+2);
bb.putShort((short)array.length).put(array);
ByteArrayInputStream bis=new ByteArrayInputStream(bb.array());
DataInputStream dis=new DataInputStream(bis);
String str=dis.readUTF();

这篇关于Java 8 UTF-8 编码问题(Java 错误?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆