Python 2.7:奇怪的Unicode行为 [英] Python 2.7: Strange Unicode behavior
问题描述
我在Python 2.7中遇到以下行为:
I am experiencing the following behavior in Python 2.7:
>>> a1 = u'\U0001f04f' #1
>>> a2 = u'\ud83c\udc4f' #2
>>> a1 == a2 #3
False
>>> a1.encode('utf8') == a2.encode('utf8') #4
True
>>> a1.encode('utf8').decode('utf8') == a2.encode('utf8').decode('utf8') #5
True
>>> u'\ud83c\udc4f'.encode('utf8') #6
'\xf0\x9f\x81\x8f'
>>> u'\ud83c'.encode('utf8') #7
'\xed\xa0\xbc'
>>> u'\udc4f'.encode('utf8') #8
'\xed\xb1\x8f'
>>> '\xd8\x3c\xdc\x4f'.decode('utf_16_be') #9
u'\U0001f04f'
对此行为的解释是什么?更具体地说:
What is the explanation for this behavior? More specifically:
- 如果语句#5为真,我希望两个字符串相等,而#3则相反.
- 将两个代码点编码在一起,就像在语句#6中一样,产生的结果与在#7和#8中一一编码时的结果不同.看起来这两个代码点被视为一个4字节代码点.但是,如果我实际上希望将它们视为两个不同的代码点怎么办?
- 从#9中可以看到,
a2
中的数字实际上是使用UTF-16-BE进行a1
编码的,但是尽管它们是使用Unicode字符串(!)内的\u
指定为Unicode代码点的,但是Python仍然可以以某种方式在#5中达到平等.怎么可能?
- I'd expect two strings to be equal if statement #5 is true, while #3 proves otherwise.
- Encoding both code points together like in statement #6 yields results different from when encoded one by one in #7 and #8. Looks like the two code points are treated as one 4-byte code point. But what if I actually want them to be treated as two different code points?
- As you can see from #9 the numbers in
a2
are actuallya1
encoded using UTF-16-BE but although they were specified as Unicode code points using\u
inside a Unicode string (!), Python still could somehow get to equality in #5. How could it be possible?
这里没有任何意义!发生了什么事?
Nothing makes sense here! What's going on?
推荐答案
Python 2违反了Unicode标准,它允许您至少在UCS4构建中允许对U + D800到U + DFFF范围内的代码点进行编码.来自维基百科:
A Python 2 is violating the Unicode standard here, by permitting you to encode codepoints in the range U+D800 to U+DFFF, at least in a UCS4 build. From Wikipedia:
Unicode标准将这些代码点值永久保留为高和低替代项的UTF-16编码,并且永远不会分配给它们一个字符,因此应该没有理由对其进行编码.正式的Unicode标准说,没有UTF格式(包括UTF-16)可以对这些代码点进行编码.
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.
UTF-16代理对代码点的官方UTF-8标准没有编码,因此,当您尝试使用Python 3时,Python 3会引发异常:
The official UTF-8 standard has no encoding for UTF-16 surrogate pair codepoints, so Python 3 raises an exception when you try:
>>> '\ud83c\udc4f'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
但是Python 2的Unicode支持更加基本,并且您观察到的行为随
But Python 2's Unicode support is a bit more rudimentary, and the behaviour you observe varies with the specific UCS2 / UCS4 build variant; on a UCS2 build, your variables are equal:
>>> import sys
>>> sys.maxunicode
65535
>>> a1 = u'\U0001f04f'
>>> a2 = u'\ud83c\udc4f'
>>> a1 == a2
True
因为在这样的构建中,所有非BMP代码点都被编码为UTF-16代理对(在UCS2标准上扩展).
because in such a build all non-BMP codepoints are encoded as UTF-16 surrogate pairs (extending on the UCS2 standard).
因此,在UCS2构建中,两个值之间没有区别,并且当您假设要编码U + 1F04F时,选择编码为完整的非BMP代码点是完全有效和其他此类代码点. UCS4构建恰好符合该行为.
So on a UCS2 build there is no difference between your two values, and the choice to encode to the full non-BMP codepoint is entirely valid when you assume you would want to encode U+1F04F and other such codepoints. The UCS4 build just matches that behaviour.
这篇关于Python 2.7:奇怪的Unicode行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!