Java如何在其16位字符类型中存储UTF-16字符? [英] How does Java store UTF-16 characters in its 16-bit char type?

查看:196
本文介绍了Java如何在其16位字符类型中存储UTF-16字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Java SE 7规范,Java使用Unicode UTF-16标准来表示字符。
当想象一个 String 作为每个包含一个字符的16位变量的简单数组时,生活很简单。

According to the Java SE 7 Specification, Java uses the Unicode UTF-16 standard to represent characters. When imagining a String as a simple array of 16-bit variables each containing one character, life is simple.

不幸的是,有16位的代码点根本不够(我相信它是所有Unicode字符的16/17)。所以在 String 中,这并不直接出现问题,因为当想要使用另外两个字节存储这些〜1.048.576个字符之一时,只需要使用 String 中的两个数组位置。

Unfortunately, there are code points for which 16 bits simply aren't enough (I believe it was 16/17th of all Unicode characters). So in a String, this poses no direct problem, because when wanting to store one of these ~1.048.576 characters using an additional two bytes, simply two array positions in that String would be used.

这样,没有任何直接问题,适用于 String s,因为总是可以有另外两个字节。尽管涉及单一变量,与UTF-16编码相反,固定长度为16位,但如何存储这些字符,特别是Java如何与其 2字节char类型

This, without posing any direct problem, works for Strings, because there can always be an additional two bytes. Though when it comes to single variables which, in contrast to the UTF-16 encoding, have a fixed length of 16 bits, how can these characters be stored, and in particular, how does Java do it with its 2-byte "char" type?

推荐答案

答案是在 Javadoc


char数据类型(因此Character对象
封装的值)基于原始的Unicode规范,其中
定义的字符作为固定宽度的16位实体。 Unicode
标准已经被更改为允许
表示需要超过16位的字符。

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits.

法定代码的范围
点现在是U + 0000到U + 10FFFF,称为Unicode标量值。
(参考Unicode标准中U + n符号的定义)
从U + 0000到U + FFFF的字符集有时称为
作为基本多语言平面(BMP)。代码为
的字符大于U + FFFF,称为补充字符。 Java
2平台在char数组和
String和StringBuffer类中使用UTF-16表示形式。在这个表示中,补充
字符表示为一对char值,第一个从
表示高代理范围(\\\�-\\\�),第二个来自
低代理范围(\\\�-\\\�)。

The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode standard.) The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

因此,一个char值,
表示基本多语言平面(BMP)代码点,包括
替代代码点,或UTF-16编码的代码单位。 int
值表示所有Unicode代码点,包括补充代码
点。 int的最低(最低有效)21位用于
表示Unicode代码点,高(最高有效)11位
必须为零。

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero.

除非另有说明,否则关于
补充字符和代理字符值的行为如下:
仅接受char值的方法不能支持补充
字符。他们将代理范围中的char值视为
未定义的字符。例如,Character.isLetter('\\\�')
返回false,即使这个特定值,如果后跟任何
低代码值在字符串中将代表一个字母。接受int值的方法
支持所有Unicode字符,包括
个补充字符。例如,Character.isLetter(0x2F81A)
返回true,因为代码点值代表一个字母(CJK
表意文字)。在Java SE API文档中,Unicode代码点是
,用于U + 0000和U + 10FFFF之间的字符值,
和Unicode代码单位用于16位字符值,即代码
单位的UTF-16编码。有关Unicode
术语的更多信息,请参阅Unicode词汇表。

Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows: The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter. The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph). In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.

简单地说:


  • 一个char规则的16位是为旧版本的Unicode标准设计的。

  • 你有时需要两个字符来表示不在基本多语言平面中的unicode符文(代码点)。这种工作,因为你不经常使用字符,特别是处理BMP之外的unicode符文。

更简单的说:


  • 一个java字符不代表Unicode代码点(好的,不总是)。

除此之外,可以注意到,Unicode扩展到BMP之后,UTF-16全局无关,现在UTF-16甚至不能启用一个固定的字节 - 比率。这就是为什么更多的现代语言是基于UTF-8。这个宣言有助于理解它。

As an aside, it can be noted that the evolution of Unicode to extend past the BMP made UTF-16 globally irrelevant, now that UTF-16 doesn't even enable a fixed byte-chars ratio. That's why more modern languages are based on UTF-8. This manifesto helps understand it.

这篇关于Java如何在其16位字符类型中存储UTF-16字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆