在Java中将字节转换为String时会发生什么？ [英] What happens under the hood when bytes converted to String in Java?

查看：135 发布时间：2018/12/20 21:53:46 java string unicode utf-8 byte

本文介绍了在Java中将字节转换为String时会发生什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

尝试在Java中将字节转换为String时遇到问题，代码如下：

  byte [] bytes = {1,2，-3}; 
 
 byte [] transferred = new String（bytes，Charsets.UTF_8）.getBytes（Charsets.UTF_8）;

且原始字节与传输的字节不同，分别为

  [1,2，-3] 
 [1,2，-17，-65，-67]

我曾经认为这是由于负数-3的UTF-8字符集映射。所以我把它改成-32。但转移的阵列保持不变！

  [1,2，-32] 
 [1,2， - 17，-65，-67]

所以我非常想知道当我打电话给新的时候会发生什么字符串（字节）：）

解决方案

并非所有字节序列在UTF-8中都有效。

中）是非法的，但你的字节数组没有这样的顺序。

您的UTF-8无效。 Java UTF-8解码器使用Unicode代码点替换此无效字节 -3 U + FFFD REPLACEMENT CHARACTER （另见这个）。在UTF-8中，代码点U + FFFD是十六进制 0xEF 0xBF 0xBD （二进制 11101111 10111111 10111101 ），用Java表示as -17，-65，-67 。

I have a problem when trying to convert bytes to String in Java, with code like:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8);

and the original bytes are not the same as the transferred bytes, which are respectively

[1, 2, -3]
[1, 2, -17, -65, -67]

I once thought it is due to the UTF-8 charset mapping for the negative "-3". So I change it to "-32". But the transferred array remains the same!

[1, 2, -32]
[1, 2, -17, -65, -67]

So I strongly want to know exactly what happens when I call new String(bytes) :)

解决方案

Not all sequences of bytes are valid in UTF-8.

UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.

Refer to this table:

Now let's see how it applies to your {1, 2, -3}:

Bytes 1 (hex 0x01, binary 00000001) and 2 (hex 0x02, binary 00000010) stand alone, no problem.

Byte -3 (hex 0xFD, binary 11111101) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.

Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101), represented in Java as -17, -65, -67.

这篇关于在Java中将字节转换为String时会发生什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Java中将字节转换为String时会发生什么？ [英] What happens under the hood when bytes converted to String in Java?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

在Java中将字节转换为String时会发生什么？ [英] What happens under the hood when bytes converted to String in Java?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭