Unicode 可以映射多少个字符? [英] How many characters can be mapped with Unicode?

查看:62
本文介绍了Unicode 可以映射多少个字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要求 Unicode 中所有可能的有效组合的计数以及解释.我知道一个字符可以编码为 1、2、3 或 4 个字节.我也不明白为什么连续字节有限制,即使该字符的起始字节清除了它应该是多长时间.

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

推荐答案

我要求 Unicode 中所有可能的有效组合的数量,并附上解释.

I am asking for the count of all the possible valid combinations in Unicode with explanation.

1,111,998:17 架飞机 ×每个平面 65,536 个字符 - 2048 个代理 - 66 个非字符

1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters

请注意,UTF-8 和 UTF-32 理论上可以编码的平面远多于 17 个,但根据 UTF-16 编码的限制.

Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.

137,929 个代码点实际上是在 Unicode 12.1.

137,929 code points are actually assigned in Unicode 12.1.

我也不明白为什么连续字节有限制,即使该字符的起始字节清除了它应该是多长.

I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

UTF-8 中这个限制的目的是使编码 自我同步.

The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.

举个反例,考虑中文GB 18030编码.在那里,字母 ß 表示为字节序列 81 30 89 38,其中包含数字 08 的编码.因此,如果您的字符串搜索功能不是为这种特定于编码的怪癖而设计的,那么搜索数字 8 将在字母 ß 中找到误报.

For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.

在 UTF-8 中,这不会发生,因为前导字节和尾字节之间的非重叠保证了较短字符的编码永远不会出现在较长字符的编码中.

In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.

这篇关于Unicode 可以映射多少个字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆