Unicode可以映射多少个字符? [英] How many characters can be mapped with Unicode?
问题描述
我要求提供带解释的Unicode中所有可能有效组合的计数.我知道一个char可以编码为1,2,3或4个字节.即使该char的起始字节清除了应该持续多长时间,我也不明白为什么连续字节有限制.
I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
推荐答案
我正在询问Unicode中所有可能的有效组合的数量,并附有解释.
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998 :17架飞机和次;每架飞机65,536个字符-2048个替代-66个非字符
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
请注意,UTF-8和UTF-32理论上可以编码超过17个平面,但是范围受
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 个代码点>.
即使该char的起始字节清除了该长度,我也不明白为什么连续字节有限制.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
UTF-8中此限制的目的是使编码自同步.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
作为反例,请考虑中文 GB 18030编码.此处,字母ß
表示为字节序列81 30 89 38
,其中包含数字0
和8
的编码.因此,如果您有一个不是为此编码特定的怪癖设计的字符串搜索功能,则搜索数字8
会在字母ß
内发现误报.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß
is represented as the byte sequence 81 30 89 38
, which contains the encoding of the digits 0
and 8
. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8
will find a false positive within the letter ß
.
在UTF-8中,这不会发生,因为前导字节和尾部字节之间的非重叠保证了较短字符的编码永远不会在较长字符的编码内发生.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.
这篇关于Unicode可以映射多少个字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!