为什么 Java 字符使用 UTF-16? [英] Why Java char uses UTF-16?

查看:27
本文介绍了为什么 Java 字符使用 UTF-16?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我阅读了很多关于 Unicode 代码点以及它们如何随时间演变的内容,并且确定我阅读了 http://www.joelonsoftware.com/articles/Unicode.html 这也是.

Recently I read lots of things about Unicode code points and how they evolved over time and sure I read http://www.joelonsoftware.com/articles/Unicode.html this also.

但我找不到真正的原因是为什么 Java 使用 UTF-16 作为字符.

But something I couldn't find the real reason for is why Java uses UTF-16 for a char.

例如,如果我有一个包含 1024 个字母的 ASCII 范围字符串的字符串.这意味着 1024 * 2 bytes 等于它会以任何方式消耗的 2KB 字符串内存.

For example, If I had the string which contains 1024 letters of ASCII scoped character string. It means 1024 * 2 bytes which equals 2KB string memory which it will consume in any way.

因此,如果 Java 基本字符是 UTF-8,那么它只有 1KB 的数据.即使字符串中有任何需要 2bytes 的字符,例如 10 个字符的字",自然会增加内存消耗的大小.(1014 * 1 byte) + (10 * 2 bytes) = 1KB + 20 bytes

So if Java base char would be UTF-8 it would be just 1KB of data. Even if the string has any character which needs to 2bytes for example 10 character of "字" naturally it will increase the size of the memory consumption. (1014 * 1 byte) + (10 * 2 bytes) = 1KB + 20 bytes

结果不是那么明显 1KB + 20 字节 VS.2KB 我不说 ASCII,但我对此的好奇心是为什么它不是 UTF-8,它也只处理多字节字符.在任何包含大量非多字节字符的字符串中,UTF-16 看起来都是一种内存浪费.

The result isn't that obvious 1KB + 20 bytes VS. 2KB I don't say about ASCII but my curiosity about this is why is it not UTF-8 which just takes care of multibyte chars also. UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.

这背后有什么好的理由吗?

Is there any good reason behind this?

推荐答案

Java used UCS-22004/2005 中转换 UTF-16 之前一>.当初选择UCS-2的原因是主要是历史:

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:

Unicode 最初设计为固定宽度的 16 位字符编码.Java 编程语言中的原始数据类型 char 旨在通过提供一种可以容纳任何字符的简单数据类型来利用这种设计.

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.

这一点,以及 UTF-16 的诞生,在 Unicode 常见问题页面进一步解释:

This, and the birth of UTF-16, is further explained by the Unicode FAQ page:

最初,Unicode 被设计为纯 16 位编码,旨在代表所有现代文字.(古代文字将用专用字符表示.)随着时间的推移,特别是在添加了超过 14,500 个复合字符以与遗留集兼容之后,很明显 16 位对于用户社区来说是不够的.由此产生了 UTF-16.

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

由于@wero 已经提到,使用 UTF-8 无法有效地进行随机访问.所以综合权衡之后,UCS-2 似乎是当时最好的选择,尤其是在那个阶段没有分配补充字符的情况下.这让 UTF-16 成为最简单的自然发展.

As @wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as the no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.

这篇关于为什么 Java 字符使用 UTF-16?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆