UTF-8中中文字符的上限和下限是多少？ [英] What are the upper and lower bound for Chinese char in UTF-8?

查看：90 发布时间：2020/10/1 21:06:39 python cjk

本文介绍了UTF-8中中文字符的上限和下限是多少？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在python中建立一个包含所有中文字符的 ord（）的集合：

I would like to make a set in python contains all the ord() of the Chinese chars:

对于英语来说等效为：

english = set(range(ord('a'),ord('z') + 1 ) +
              range(ord('A'),ord('Z') + 1 ))

推荐答案

从Unicode标准（v6.0，第12.1节）开始，

From the Unicode Standard (v6.0, section 12.1),

汉字表意字符出现在Unicode标准的七个主要块中，如表12-2所示。

Han ideographic characters are found in seven main blocks of the Unicode Standard, as shown in Table 12-2

Table 12-2. Blocks Containing Han Ideographs

Block                                   | Range       | Comment
----------------------------------------+-------------+-----------------------------------------------------
CJK Unified Ideographs                  | 4E00–9FFF   | Common
CJK Unified Ideographs Extension A      | 3400–4DBF   | Rare
CJK Unified Ideographs Extension B      | 20000–2A6DF | Rare, historic
CJK Unified Ideographs Extension C      | 2A700–2B73F | Rare, historic
CJK Unified Ideographs Extension D      | 2B740–2B81F | Uncommon, some in current use
CJK Compatibility Ideographs            | F900–FAFF   | Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants

在这些代码块之外，还有一些附加功能：

And there are a couple of extras, outside of these blocks:

Table 12-3. Small Extensions to the URO

Range     | Version | Comment
----------+---------+-------------------------------------------------
9FA6–9FB3 | 4.1     | Interoperability with HKSCS standard
9FB4–9FBB | 4.1     | Interoperability with GB 18030 standard
9FBC–9FC2 | 5.1     | Interoperability with commercial implementations
9FC3      | 5.1     | Correction of mistaken unification
9FC4–9FC6 | 5.2     | Interoperability with ARIB standard
9FC7–9FCB | 5.2     | Interoperability with HKSCS standard

要使用set操作构造这些序列值的集合，您可以执行以下操作：

To use set operations to construct a set of the ordinal values of these, you can do this:

chinese = set(range(0x4E00, 0xA000) +
              range(0x3400, 0x4DC0) +
              range(0x20000, 0x2A6E0) +
              range(0x2A700, 0x2B740) +
              range(0x2B740, 0x2B820) +
              range(0xF900, 0xFB00) +
              range(0x2F800, 0x2FA20) +
              range(0x9FA6, 0x9FCC))

但是请注意，此集合包含超过75000个字符，因此它可能不是最紧凑或最有效的数据结构。

Be aware, though, that this set contains over 75000 characters, so it may not be the most compact or efficient data structure for this.

此外，如果您坚持对文字字符使用ord（），则需要使用32位Unicode文字形式：

Also, if you insist on using ord() on literal characters, you will need to use the 32-bit unicode literal form:

>>> ord(u'\U00002F800')
194560

这篇关于UTF-8中中文字符的上限和下限是多少？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

UTF-8中中文字符的上限和下限是多少？ [英] What are the upper and lower bound for Chinese char in UTF-8?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

UTF-8中中文字符的上限和下限是多少？ [英] What are the upper and lower bound for Chinese char in UTF-8?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭