在Java中将字节转换为String时会发生什么? [英] What happens under the hood when bytes converted to String in Java?

查看:135
本文介绍了在Java中将字节转换为String时会发生什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试在Java中将字节转换为String时遇到问题,代码如下:

  byte [] bytes = {1,2,-3}; 

byte [] transferred = new String(bytes,Charsets.UTF_8).getBytes(Charsets.UTF_8);

且原始字节与传输的字节不同,分别为

  [1,2,-3] 
[1,2,-17,-65,-67]

我曾经认为这是由于负数-3的UTF-8字符集映射。所以我把它改成-32。但转移的阵列保持不变!

  [1,2,-32] 
[1,2, - 17,-65,-67]

所以我非常想知道当我打电话给新的时候会发生什么字符串(字节):)

解决方案

并非所有字节序列在UTF-8中都有效。



中)是非法的,但你的字节数组没有这样的顺序。



您的UTF-8无效。 Java UTF-8解码器使用Unicode代码点替换此无效字节 -3 U + FFFD REPLACEMENT CHARACTER (另见这个)。在UTF-8中,代码点U + FFFD是十六进制 0xEF 0xBF 0xBD (二进制 11101111 10111111 10111101 ),用Java表示as -17,-65,-67


I have a problem when trying to convert bytes to String in Java, with code like:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8);

and the original bytes are not the same as the transferred bytes, which are respectively

[1, 2, -3]
[1, 2, -17, -65, -67]

I once thought it is due to the UTF-8 charset mapping for the negative "-3". So I change it to "-32". But the transferred array remains the same!

[1, 2, -32]
[1, 2, -17, -65, -67] 

So I strongly want to know exactly what happens when I call new String(bytes) :)

解决方案

Not all sequences of bytes are valid in UTF-8.

UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.

Refer to this table:

Now let's see how it applies to your {1, 2, -3}:

Bytes 1 (hex 0x01, binary 00000001) and 2 (hex 0x02, binary 00000010) stand alone, no problem.

Byte -3 (hex 0xFD, binary 11111101) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.

Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101), represented in Java as -17, -65, -67.

这篇关于在Java中将字节转换为String时会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆