为什么一个汉字需要一个字符(2个字节)却要3个字节? [英] why a Chinese character takes one char (2 bytes) but 3 bytes?

查看:203
本文介绍了为什么一个汉字需要一个字符(2个字节)却要3个字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下程序可以测试Java如何处理汉字:

I have the following program to test how Java handle Chinese characters:

String s3 = "世界您好";
char[] chs = s3.toCharArray();
byte[] bs = s3.getBytes(StandardCharsets.UTF_8);
byte[] bs2 = new String(chs).getBytes(StandardCharsets.UTF_8);

System.out.println("encoding=" + Charset.defaultCharset().name() + ", " + s3 + " char[].length=" + chs.length
                + ", byte[].length=" + bs.length + ", byte[]2.length=" + bs2.length);

打印输出是这样的:

encoding = UTF-8,世界您好char []。length = 4,byte []。length = 12,byte [] 2.length = 12

encoding=UTF-8, 世界您好 char[].length=4, byte[].length=12, byte[]2.length=12

结果如下:


  1. 一个汉字占一个 char ,如果使用 char [] 来保存汉字,则在Java中为2个字节;

  1. one Chinese character takes one char, which is 2 bytes in Java, if char[] is used to hold the Chinese characters;

如果 byte [] 是一个汉字,则需要3个 byte s

one Chinese character takes 3 bytes if byte[] is used to hold the Chinese characters;

我的问题是2个字节是否足够,为什么我们要使用3个字节?如果2个字节不够用,为什么我们要使用2个字节?

My questions are if 2 bytes are enough, why we use 3 bytes? if 2 bytes is not enough, why we use 2 bytes?

编辑:

我的JVM的默认编码设置为UTF-8。

My JVM's default encoding is set to UTF-8.

推荐答案

Java char类型将16位数据存储为两个-byte对象,使用每一位来存储数据。 UTF-8不会这样做。对于汉字,UTF-8仅使用每个字节的6位来存储数据。其他两位包含控制信息。 (它取决于字符。对于ASCII字符,UTF-8使用7位。)这是一种复杂的编码机制,但是它允许UTF-8存储最长32位的字符。这样做的好处是,每个字符仅占用7个(ASCII)字符一个字节,从而使其与ASCII向后兼容。但是它需要3个字节来存储16位数据。您可以通过在Wikipedia上查找来了解其工作原理。

A Java char type stores 16 bits of data in a two-byte object, using every bit to store the data. UTF-8 doesn't do this. For Chinese characters, UTF-8 only uses 6 bits of each byte to store the data. The other two bits contain control information. (It varies depending on the character. For ASCII characters, UTF-8 uses 7 bits.) It's a complicated encoding mechanism, but it allows UTF-8 to store characters up to 32-bits long. This has the advantage of taking only one byte per character for 7-bit (ASCII) characters, making it backward compatible with ASCII. But it needs 3 bytes to store 16-bits of data. You can learn how it works by looking it up on Wikipedia.

这篇关于为什么一个汉字需要一个字符(2个字节)却要3个字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆