为什么Java生态系统在其软件堆栈中使用不同的字符编码? [英] Why does the Java ecosystem use different character encodings throughout their software stack?

查看:114
本文介绍了为什么Java生态系统在其软件堆栈中使用不同的字符编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,类文件使用CESU-8(有时也称为MUTF-8),但内部Java首先使用UCS-2,现在它使用UTF-16。有关Java源文件的规范说明,最小化的Java编译器只需要接受ASCII字符。

For instance class files use CESU-8 (sometimes also called MUTF-8), but internally Java first used UCS-2 and now it uses UTF-16. The specification about valid Java source files says that a minimal conforming Java compiler only has to accept ASCII characters.

这些选择的原因是什么?在整个Java生态系统中使用相同的编码是不是更有意义?

What's the reason for these choices? Wouldn't it make more sense to use the same encoding throughout the Java ecosystem?

推荐答案

源文件的ASCII是因为在时间是不合理的期望人们具有完整的Unicode支持的文本编辑器。事情有所改善,但还不完善。 Jave中的整个 \uXXXX 的东西基本上是Java相当于C的三位符号。 (当C被创建时,一些键盘没有花括号,所以你必须使用三字母!)

ASCII for source files is because at the time it wasn't considered reasonable to expect people to have text editors with full Unicode support. Things have improved since, but they still aren't perfect. The whole \uXXXX thing in Jave is essentially Java's equivalent to C's trigraphs. (When C was created, some keyboards didn't have curly braces, so you had to use trigraphs!)

在Java创建时,使用的类文件格式UTF-8和运行时使用UCS-2。 Unicode有不到64k的代码点,所以16位就够了。之后,当添加了额外的飞机到Unicode时,UCS-2被替换为(几乎)兼容的UTF-16,而UTF-8被替换为CESU-8(因此兼容性编码方案...)。

At the time Java was created, the class file format used UTF-8 and the runtime used UCS-2. Unicode had less than 64k codepoints, so 16 bits was enough. Later, when additional "planes" were added to Unicode, UCS-2 was replaced with the (pretty much) compatible UTF-16, and UTF-8 was replaced with CESU-8 (hence "Compatibility Encoding Scheme...").

在类文件格式中,他们想使用UTF-8来节省空间。类文件格式(包括JVM指令集)的设计非常适合于紧凑性。

In the class file format they wanted to use UTF-8 to save space. The design of the class file format (including the JVM instruction set) was heavily geared towards compactness.

在运行时他们想使用UCS-2,因为感觉到节省空间不如能够避免处理可变宽度字符的重要性。不幸的是,现在这种UTF-16已经出现了这种现象,因为一个代码点现在可以占用多个char,更糟糕的是,char数据类型现在已经被命名了(它通常不再对应一个字符,而是对应于UTF-16代码单元)。

In the runtime they wanted to use UCS-2 because it was felt that saving space was less important than being able to avoid the need to deal with variable-width characters. Unfortunately, this kind of backfired now that it's UTF-16, because a codepoint can now take multiple "chars", and worse, the "char" datatype is now sort of misnamed (it no longer corresponds to a character, in general, but instead corresponds to a UTF-16 code-unit).

这篇关于为什么Java生态系统在其软件堆栈中使用不同的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆