Unicode是否有定义的最大代码点数? [英] Does Unicode have a defined maximum number of code points?

查看:270
本文介绍了Unicode是否有定义的最大代码点数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了了解Unicode代码点的最大数量,我阅读了许多文章,但是我没有找到最终答案.

I have read many articles in order to know what is the maximum number of the Unicode code points, but I did not find a final answer.

我了解到Unicode代码点被最小化,以使所有UTF-8 UTF-16和UTF-32编码都能够处理相同数量的代码点.但是这个代码点数是多少?

I understood that the Unicode code points were minimized to make all of the UTF-8 UTF-16 and UTF-32 encodings able to handle the same number of code points. But what is this number of code points?

我遇到的最常见的答案是Unicode代码点的范围是0x000000到0x10FFFF(1,114,112代码点),但我在其他地方也读到它是1,112,114代码点.那么要给一个数字还是比这个更复杂的问题?

The most frequent answer I encountered is that Unicode code points are in the range of 0x000000 to 0x10FFFF (1,114,112 code points) but I have also read in other places that it is 1,112,114 code points. So is there a one number to be given or is the issue more complicated than that?

推荐答案

Unicode中的最大有效代码点为U + 10FFFF,这使其成为21位代码集(但并非所有21位整数都是有效的Unicode代码)点;特别是0x110000到0x1FFFFF之间的值不是有效的Unicode代码点.

The maximum valid code point in Unicode is U+10FFFF, which makes it a 21-bit code set (but not all 21-bit integers are valid Unicode code points; specifically the values from 0x110000 to 0x1FFFFF are not valid Unicode code points).

这是数字1114112的来源:U + 0000 ..U + 10FFFF是1,114,112个值.

This is where the number 1,114,112 comes from: U+0000 .. U+10FFFF is 1,114,112 values.

但是,还有一组代码点是UTF-16的替代.它们的范围是U + D800 .. U + DFFF.这是为UTF-16保留的2048个代码点.

However, there are also a set of code points that are the surrogates for UTF-16. These are in the range U+D800 .. U+DFFF. This is 2048 code points that are reserved for UTF-16.

1,114,112-2,048 = 1,112,064

1,114,112 - 2,048 = 1,112,064

还有66个非字符.这些定义在勘误#9 中:34个值,形式为U + nFFFE和U + nFFFF(其中 n 是值0x00000、0x10000,... 0xF0000、0x100000)和32个值U + FDD0-U + FDEF.减去这些也得出1,111,998个可分配字符.保留了三个范围供私人使用":U + E000 .. U + F8FF,U + F0000 .. U + FFFFD和U + 100000 .. U + 10FFFD.实际分配的值的数量取决于您正在查看的Unicode版本.您可以在 Unicode联盟中找到有关最新版本的信息.除其他外,其中的导言说:

There are also 66 non-characters. These are defined in part in Corrigendum #9: 34 values of the form U+nFFFE and U+nFFFF (where n is a value 0x00000, 0x10000, … 0xF0000, 0x100000), and 32 values U+FDD0 - U+FDEF. Subtracting those too yields 1,111,998 allocatable characters. There are three ranges reserved for 'private use': U+E000 .. U+F8FF, U+F0000 .. U+FFFFD, and U+100000 .. U+10FFFD. And the number of values actually assigned depends on the version of Unicode you're looking at. You can find information about the latest version at the Unicode Consortium. Amongst other things, the Introduction there says:

Unicode标准7.0版包含112,956个字符

The Unicode Standard, Version 7.0, contains 112,956 characters

因此,仅分配了大约10%的可用代码点.

So only about 10% of the available code points have been allocated.

我无法解释为什么您发现1,112,114个代码点.

I can't account for why you found 1,112,114 as the number of code points.

偶然地,选择上限U + 10FFFF,以便Unicode中的所有值都可以用UTF-16中的一个或两个2字节编码单位表示,使用一个高位替代和一个低位替代来代表UTF-16之外的值. BMP或基本多语言平面,范围为U + 0000 .. U + FFFF.

Incidentally, the upper limit U+10FFFF is chosen so that all the values in Unicode can be represented in one or two 2-byte coding units in UTF-16, using one high surrogate and one low surrogate to represent values outside the BMP or Basic Multilingual Plane, which is the range U+0000 .. U+FFFF.

这篇关于Unicode是否有定义的最大代码点数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆