Unicode中的Python - 只是UTF-16? [英] Unicode in Python - just UTF-16?

查看:136
本文介绍了Unicode中的Python - 只是UTF-16?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我的Python世界很高兴,知道我正在做所有的Unicode和编码为UTF-8当我需要输出的东西给用户。然后,我的一个同事给我发送了关于UTF-8的这篇文章,它让我困惑。

I was happy in my Python world knowing that I was doing everything in Unicode and encoding as UTF-8 when I needed to output something to a user. Then, one of my colleagues sent me this article on UTF-8 and it confused me.

本文的作者指出,UCS-2(Python使用的Unicode表示形式与UTF-16同义)的次数。他甚至直接说Python使用UTF-16来表示内部字符串。

The author of the article indicates a number of times that UCS-2, the Unicode representation that Python uses is synonymous with UTF-16. He even goes as far as directly saying Python uses UTF-16 for internal string representation.

作者也承认是一个Windows爱好者和开发人员,并指出MS已经处理字符编码多年来导致该组是最困惑的,所以也许它只是他自己的混乱。我不知道...

The author also admits to being a Windows lover and developer and states that the way MS has handled character encodings over the years has led to that group being the most confused so maybe it is just his own confusion. I don't know...

有人可以解释一下UTF-16 vs Unicode在Python中的状态吗?

Can somebody please explain what the state of UTF-16 vs Unicode is in Python? Are they synonymous and if not, in what way?

推荐答案

Python中的Unicode字符串的内部表示形式到3.2)取决于Python是以 wide narrow 模式编译的。大多数Python构建都很窄(您可以使用 sys.maxunicode 检查 - 在窄构建中为65535,在宽构建中为1114111)。

The internal representation of a Unicode string in Python (versions from 2.2 up to 3.2) depends on whether Python was compiled in wide or narrow modes. Most Python builds are narrow (you can check with sys.maxunicode -- it is 65535 on narrow builds and 1114111 on wide builds).

使用宽泛的构建,字符串是内部的4字节宽字符序列,即它们使用UTF-32编码。所有代码点的长度正好是一个宽字符。

With a wide build, strings are internally sequences of 4-byte wide characters, i.e. they use the UTF-32 encoding. All code points are exactly one wide-character in length.

使用窄字体构建,字符串内部是2字节宽字符序列,使用UTF-16。除了BMP之外的字符(代码点U + 10000及以上)使用通常的UTF-16代理对存储:

With a narrow build, strings are internally sequences of 2-byte wide characters, using UTF-16. Characters beyond the BMP (code points U+10000 and above) are stored using the usual UTF-16 surrogate pairs:

>>> q = u'\U00010000'
>>> len(q)
2
>>> q[0]
u'\ud800'
>>> q[1]
u'\udc00'
>>> q
u'\U00010000'

请注意,UTF-16和UCS-不一样。 UCS-2是固定宽度编码:每个代码点编码为2字节。因此,UCS-2 不能对BMP之外的代码点进行编码。 UTF-16是一种可变宽度编码; BMP外部的代码点使用一对称为代理对的字符编码。

Note that UTF-16 and UCS-2 are not the same. UCS-2 is a fixed-width encoding: every code point is encoded as 2 bytes. Consequently, UCS-2 cannot encode code points beyond the BMP. UTF-16 is a variable-width encoding; code points outside the BMP are encoded using a pair of characters, called a surrogate pair.

请注意, 3.3,并实施 PEP 393 。现在,Unicode字符串使用足够宽以容纳最大代码点的字符来表示--8位用于ASCII字符串,16位用于BMP字符串,否则为32位。这消除了宽/窄分隔,并且还有助于减少使用许多仅ASCII字符串时的内存使用。

Note that this all changes in 3.3, with the implementation of PEP 393. Now, Unicode strings are represented using characters wide enough to hold the largest code point -- 8 bits for ASCII strings, 16 bits for BMP strings, and 32 bits otherwise. This does away with the wide/narrow divide and also helps reduce the memory usage when many ASCII-only strings are used.

这篇关于Unicode中的Python - 只是UTF-16?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆