为什么 Unicode 被限制为 0x10FFFF? [英] Why Unicode is restricted to 0x10FFFF?

查看:35
本文介绍了为什么 Unicode 被限制为 0x10FFFF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么最大 Unicode 代码点被限制为 0x10FFFF?是否可以在此代码点上方表示 Unicode - 例如0x10FFFF + 0x000001 = 0x110000 - 通过任何编码方案,如 UTF-16、UTF-8?

Why is the maximum Unicode code point restricted to 0x10FFFF? Is it possible to represent Unicode above this code point - for e.g. 0x10FFFF + 0x000001 = 0x110000 - through any encoding schemes like UTF-16, UTF-8?

推荐答案

这是因为 UTF-16. 基本多语言平面 (BMP) 之外的字符使用 代理对 UTF-16 中的第一个代码单元 (CU) 位于 0xD800–0xDBFF0xDC00–0xDFFF 之间的第二个.每个 CU 代表代码点的 10 位,允许总共 20 位的数据(0x100000 个字符)被分成 16 个平面(16×216 字符).剩余的 BMP 将代表 0x10000 个字符(代码点 0-0xFFFF)

It's because of UTF-16. Characters outside of the base multilingual plane (BMP) are represented using a surrogate pair in UTF-16 with the first code unit (CU) lies between 0xD800–0xDBFF and the second one between 0xDC00–0xDFFF. Each of the CU represents 10 bits of the code point, allowing total 20 bits of data (0x100000 characters) which is split into 16 planes (16×216 characters). The remaining BMP will represent 0x10000 characters (code points 0–0xFFFF)

因此总字符数是 17×216 = 0x100000 + 0x10000 = 0x110000 这允许代码点从 0 到 0x110000 - 1 = 0x10FFFF.或者,最后一个可表示的代码点可以这样计算:BMP 中的代码点在 0-0xFFFF 范围内,所以用代理对编码的字符的偏移量是 0xFFFF + 1 = 0x10000,这意味着最后一个代码点是代理对代表是 0xFFFFF + 0x10000 = 0x10FFFF

Therefore the total number of characters is 17×216 = 0x100000 + 0x10000 = 0x110000 which allows for code points from 0 to 0x110000 - 1 = 0x10FFFF. Alternatively the last representable code point can be calculated like this: Code points in the BMP are in the range 0–0xFFFF, so the offset for characters encoded with a surrogate pair is 0xFFFF + 1 = 0x10000, which means the last code point that a surrogate pair represents is 0xFFFFF + 0x10000 = 0x10FFFF

这由 Unicode 字符编码稳定性政策保证,上面的代码点将永远不会被分配

That's guaranteed by Unicode Character Encoding Stability Policies that a code point above that will never be assigned

General_Category 属性值 Surrogate (Cs) 是不可变的:具有该值的代码点集永远不会改变.

The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change.

历史上 UTF-8 允许 最多使用 6 个字节的 U+7FFFFFFF而 UTF-32 可以存储两倍的数量.但是由于 UTF-16 的限制,Unicode 委员会决定 UTF-8 的长度永远不能超过 4 个字节,导致与 UTF-16 的范围相同

Historically UTF-8 allows up to U+7FFFFFFF using 6 bytes whereas UTF-32 can store twice the number of that. However due to the limit in UTF-16 the Unicode committee has decided that UTF-8 can never be longer than 4 bytes, resulting in the same range as UTF-16

2003 年 11 月,UTF-8 受 RFC 3629 限制以匹配UTF-16 字符编码的约束:明确禁止高低代理字符对应的码位移除超过 3% 的三字节序列,以 U+10FFFF 结尾的移除超过 48% 的三字节序列四字节序列以及所有五字节和六字节序列.

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

https://en.wikipedia.org/wiki/UTF-8#History

同样适用于 UTF-32

The same has been applied to UTF-32

在 2003 年 11 月,Unicode 受到 RFC 3629 的限制以匹配 UTF-16 编码的约束:明确禁止大于 U+10FFFF 的代码点(以及 U+D800 到 U+DFFF 的高低代理).这个有限的子集定义了 UTF-32

In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32

https://en.wikipedia.org/wiki/UTF-32

您可以阅读这个更详细的答案

这篇关于为什么 Unicode 被限制为 0x10FFFF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆