Java 如何在其 16 位字符类型中存储 UTF-16 字符? [英] How does Java store UTF-16 characters in its 16-bit char type?

查看:23
本文介绍了Java 如何在其 16 位字符类型中存储 UTF-16 字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Java SE 7规范,Java使用Unicode UTF-16标准来表示字符.将 String 想象成一个 简单的数组 16 位变量,每个变量包含一个字符,生活很简单.

According to the Java SE 7 Specification, Java uses the Unicode UTF-16 standard to represent characters. When imagining a String as a simple array of 16-bit variables each containing one character, life is simple.

不幸的是,有些代码点 16 位根本不够(我相信它是所有 Unicode 字符的 16/17).所以在 String 中,这不会带来直接的问题,因为当想要使用 额外的两个字节存储这些 ~1.048.576 个字符之一时,只需在其中的两个数组位置String 将被使用.

Unfortunately, there are code points for which 16 bits simply aren't enough (I believe it was 16/17th of all Unicode characters). So in a String, this poses no direct problem, because when wanting to store one of these ~1.048.576 characters using an additional two bytes, simply two array positions in that String would be used.

这不会造成任何直接问题,适用于String,因为总是可以有额外的两个字节.尽管与 UTF-16 编码相比,单个变量具有 16 位的固定长度,但如何存储这些字符,尤其是 Java 如何使用它的 2 字节字符"类型?

This, without posing any direct problem, works for Strings, because there can always be an additional two bytes. Though when it comes to single variables which, in contrast to the UTF-16 encoding, have a fixed length of 16 bits, how can these characters be stored, and in particular, how does Java do it with its 2-byte "char" type?

推荐答案

答案在javadoc :

char 数据类型(以及 Character 对象的值)封装)基于原始的 Unicode 规范,其中将字符定义为固定宽度的 16 位实体.统一码此后标准已更改为允许字符表示需要超过 16 位.

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits.

合法代码范围点现在是 U+0000 到 U+10FFFF,称为 Unicode 标量值.(请参阅 Unicode 标准中 U+n 符号的定义.)从 U+0000 到 U+FFFF 的字符集有时被称为作为基本多语言平面(BMP).代码点的字符大于 U+FFFF 的称为增补字符.爪哇2 平台在字符数组和字符数组中使用 UTF-16 表示String 和 StringBuffer 类.在此表示中,补充字符表示为一对 char 值,第一个来自高代理范围,(uD800-uDBFF),第二个低代理范围 (uDC00-uDFFF).

The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode standard.) The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (uD800-uDBFF), the second from the low-surrogates range (uDC00-uDFFF).

一个字符值,因此,表示基本多语言平面 (BMP) 代码点,包括代理代码点或 UTF-16 编码的代码单元.一个整数value 表示所有 Unicode 代码点,包括补充代码点.int 的低(最低有效)21 位用于表示 Unicode 代码点和高(最重要)11 位必须为零.

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero.

除非另有说明,有关行为补充字符和代理字符值如下:只接受 char 值的方法不能支持补充人物.他们将代理范围中的 char 值视为未定义的字符.例如, Character.isLetter('uD840')返回 false,即使此特定值后跟任何字符串中的低代理值将代表一个字母.方法接受 int 值的支持所有 Unicode 字符,包括补充字符.例如,Character.isLetter(0x2F81A)返回 true 因为代码点值代表一个字母(一个 CJK表意文字).在 Java SE API 文档中,Unicode 代码点是用于 U+0000 到 U+10FFFF 范围内的字符值,和 Unicode 代码单元用于作为代码的 16 位字符值UTF-16 编码的单位.有关 Unicode 的更多信息术语,请参阅 Unicode 词汇表.

Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows: The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter. The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph). In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.

简单地说:

  • char 规则的 16 位是为旧版本的 Unicode 标准设计的
  • 有时您需要两个字符来表示不在基本多语言平面中的 unicode 符文(代码点).这种有效"是因为您不经常使用字符,尤其是在 BMP 之外处理 unicode 符文.

更简单的说:

  • java 字符不代表 Unicode 代码点(嗯,并非总是如此).

顺便说一句,可以注意到 Unicode 的演变超越了 BMP,这使得 UTF-16 在全球范围内变得无关紧要,现在 UTF-16 甚至没有启用固定的字节字符比.这就是为什么更多现代语言基于 UTF-8.这个宣言有助于理解它.

As an aside, it can be noted that the evolution of Unicode to extend past the BMP made UTF-16 globally irrelevant, now that UTF-16 doesn't even enable a fixed byte-chars ratio. That's why more modern languages are based on UTF-8. This manifesto helps understand it.

这篇关于Java 如何在其 16 位字符类型中存储 UTF-16 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆