为什么Unicode限制为0x10FFFF? [英] Why Unicode is restricted to 0x10FFFF?

查看:593
本文介绍了为什么Unicode限制为0x10FFFF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么最大Unicode代码点限制为0x10FFFF?是否可以在此代码点上方表示Unicode-例如0x10FFFF + 0x000001 = 0x110000-通过诸如UTF-16,UTF-8之类的任何编码方案?

Why is the maximum Unicode code point restricted to 0x10FFFF? Is it possible to represent Unicode above this code point - for e.g. 0x10FFFF + 0x000001 = 0x110000 - through any encoding schemes like UTF-16, UTF-8?

推荐答案

这是因为UTF-16. BMP以外的字符使用

It's because of UTF-16. Characters outside of the BMP are represented using a surrogate pair in UTF-16 with the first code unit lies between 0xD800–0xDBFF and the second one between 0xDC00–0xDFFF. Each of the CU represents 10 bits of the code point, allowing total 20 bits of data (0x100000 characters) which is split into 16 planes (16×216 characters). The remaining BMP will represent 0x10000 characters (code points 0–0xFFFF)

因此,字符总数为 0x100000 + 0x10000 = 0x110000 ,这允许代码点从0到0x110000-1 = 0x10FFFF.或者,可以这样计算最后一个可表示的代码点:BMP中的代码点在0–0xFFFF范围内,因此使用代理对编码的字符的偏移量为0xFFFF + 1 = 0x10000,这表示最后一个代码点为代理对表示为0xFFFFF + 0x10000 = 0x10FFFF

Therefore the total number of characters is 0x100000 + 0x10000 = 0x110000 which allows for code points from 0 to 0x110000 - 1 = 0x10FFFF. Alternatively the last representable code point can be calculated like this: Code points in the BMP are in the range 0–0xFFFF, so the offset for characters encoded with a surrogate pair is 0xFFFF + 1 = 0x10000, which means the last code point that a surrogate pair represents is 0xFFFFF + 0x10000 = 0x10FFFF

Unicode字符编码稳定性策略对此进行了保证将从不分配

General_Category属性值Surrogate(Cs)是不可变的:具有该值的代码点集将永远不会改变.

The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change.

从历史上讲,UTF-8允许最多使用6个字节的U + 7FFFFFFF ,而UTF-32可以存储两倍的存储空间.但是,由于UTF-16的限制,Unicode委员会决定UTF-8的长度不得超过4个字节,从而导致与UTF-16的范围相同

Historically UTF-8 allows up to U+7FFFFFFF using 6 bytes whereas UTF-32 can store twice the number of that. However due to the limit in UTF-16 the Unicode committee has decided that UTF-8 can never be longer than 4 bytes, resulting in the same range as UTF-16

2003年11月, UTF-8受RFC 3629的限制,以匹配UTF-16字符编码:明确禁止与高,低代理字符相对应的代码点删除三字节序列中的3%以上,以U + 10FFFF结尾的代码点删除四字节中48%以上的字符序列以及所有五字节和六字节序列.

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

https://en.wikipedia.org/wiki/UTF-8#History

UTF-32上也是如此

The same has been applied to UTF-32

2003年11月,RFC 3629对Unicode进行了限制,以匹配UTF-16编码的约束:明确禁止大于U + 10FFFF的代码点(以及通过U + DFFF的高低代换U + D800).这个有限的子集定义了UTF-32

In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32

https://en.wikipedia.org/wiki/UTF-32

您可以阅读此更详细的答案

  • Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?
  • Does the Unicode Consortium Intend to make UTF-16 run out of characters?
  • How many characters can be mapped with Unicode?
  • Proposal to restrict the range of code positions to the values up to U-0010FFFF

这篇关于为什么Unicode限制为0x10FFFF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆