UTF-8中中文字符的上限和下限是多少? [英] What are the upper and lower bound for Chinese char in UTF-8?

查看:90
本文介绍了UTF-8中中文字符的上限和下限是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在python中建立一个包含所有中文字符的 ord()的集合:

I would like to make a set in python contains all the ord() of the Chinese chars:

对于英语来说等效为:

english = set(range(ord('a'),ord('z') + 1 ) +
              range(ord('A'),ord('Z') + 1 ))


推荐答案

从Unicode标准(v6.0,第12.1节)开始,

From the Unicode Standard (v6.0, section 12.1),


汉字表意字符出现在Unicode标准的七个主要块中,如表12-2所示。

Han ideographic characters are found in seven main blocks of the Unicode Standard, as shown in Table 12-2



Table 12-2. Blocks Containing Han Ideographs

Block                                   | Range       | Comment
----------------------------------------+-------------+-----------------------------------------------------
CJK Unified Ideographs                  | 4E00–9FFF   | Common
CJK Unified Ideographs Extension A      | 3400–4DBF   | Rare
CJK Unified Ideographs Extension B      | 20000–2A6DF | Rare, historic
CJK Unified Ideographs Extension C      | 2A700–2B73F | Rare, historic
CJK Unified Ideographs Extension D      | 2B740–2B81F | Uncommon, some in current use
CJK Compatibility Ideographs            | F900–FAFF   | Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants

在这些代码块之外,还有一些附加功能:

And there are a couple of extras, outside of these blocks:

Table 12-3. Small Extensions to the URO

Range     | Version | Comment
----------+---------+-------------------------------------------------
9FA6–9FB3 | 4.1     | Interoperability with HKSCS standard
9FB4–9FBB | 4.1     | Interoperability with GB 18030 standard
9FBC–9FC2 | 5.1     | Interoperability with commercial implementations
9FC3      | 5.1     | Correction of mistaken unification
9FC4–9FC6 | 5.2     | Interoperability with ARIB standard
9FC7–9FCB | 5.2     | Interoperability with HKSCS standard

要使用set操作构造这些序列值的集合,您可以执行以下操作:

To use set operations to construct a set of the ordinal values of these, you can do this:

chinese = set(range(0x4E00, 0xA000) +
              range(0x3400, 0x4DC0) +
              range(0x20000, 0x2A6E0) +
              range(0x2A700, 0x2B740) +
              range(0x2B740, 0x2B820) +
              range(0xF900, 0xFB00) +
              range(0x2F800, 0x2FA20) +
              range(0x9FA6, 0x9FCC))

但是请注意,此集合包含超过75000个字符,因此它可能不是最紧凑或最有效的数据结构。

Be aware, though, that this set contains over 75000 characters, so it may not be the most compact or efficient data structure for this.

此外,如果您坚持对文字字符使用ord() ,则需要使用32位Unicode文字形式:

Also, if you insist on using ord() on literal characters, you will need to use the 32-bit unicode literal form:

>>> ord(u'\U00002F800')
194560

这篇关于UTF-8中中文字符的上限和下限是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆