Python 2.7:奇怪的Unicode行为 [英] Python 2.7: Strange Unicode behavior

查看:97
本文介绍了Python 2.7:奇怪的Unicode行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python 2.7中遇到以下行为:

I am experiencing the following behavior in Python 2.7:

>>> a1 = u'\U0001f04f'  #1
>>> a2 = u'\ud83c\udc4f'  #2
>>> a1 == a2  #3
False
>>> a1.encode('utf8') == a2.encode('utf8')  #4
True
>>> a1.encode('utf8').decode('utf8') == a2.encode('utf8').decode('utf8')  #5
True
>>> u'\ud83c\udc4f'.encode('utf8') #6
'\xf0\x9f\x81\x8f'
>>> u'\ud83c'.encode('utf8')  #7
'\xed\xa0\xbc'
>>> u'\udc4f'.encode('utf8')  #8
'\xed\xb1\x8f'
>>> '\xd8\x3c\xdc\x4f'.decode('utf_16_be')  #9
u'\U0001f04f'

对此行为的解释是什么?更具体地说:

What is the explanation for this behavior? More specifically:

  1. 如果语句#5为真,我希望两个字符串相等,而#3则相反.
  2. 将两个代码点编码在一起,就像在语句#6中一样,产生的结果与在#7和#8中一一编码时的结果不同.看起来这两个代码点被视为一个4字节代码点.但是,如果我实际上希望将它们视为两个不同的代码点怎么办?
  3. 从#9中可以看到,a2中的数字实际上是使用UTF-16-BE进行a1编码的,但是尽管它们是使用Unicode字符串(!)内的\u指定为Unicode代码点的,但是Python仍然可以以某种方式在#5中达到平等.怎么可能?
  1. I'd expect two strings to be equal if statement #5 is true, while #3 proves otherwise.
  2. Encoding both code points together like in statement #6 yields results different from when encoded one by one in #7 and #8. Looks like the two code points are treated as one 4-byte code point. But what if I actually want them to be treated as two different code points?
  3. As you can see from #9 the numbers in a2 are actually a1 encoded using UTF-16-BE but although they were specified as Unicode code points using \u inside a Unicode string (!), Python still could somehow get to equality in #5. How could it be possible?

这里没有任何意义!发生了什么事?

Nothing makes sense here! What's going on?

推荐答案

Python 2违反了Unicode标准,它允许您至少在UCS4构建中允许对U + D800到U + DFFF范围内的代码点进行编码.来自维基百科:

A Python 2 is violating the Unicode standard here, by permitting you to encode codepoints in the range U+D800 to U+DFFF, at least in a UCS4 build. From Wikipedia:

Unicode标准将这些代码点值永久保留为高和低替代项的UTF-16编码,并且永远不会分配给它们一个字符,因此应该没有理由对其进行编码.正式的Unicode标准说,没有UTF格式(包括UTF-16)可以对这些代码点进行编码.

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.

UTF-16代理对代码点的官方UTF-8标准没有编码,因此,当您尝试使用Python 3时,Python 3会引发异常:

The official UTF-8 standard has no encoding for UTF-16 surrogate pair codepoints, so Python 3 raises an exception when you try:

>>> '\ud83c\udc4f'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

但是Python 2的Unicode支持更加基本,并且您观察到的行为随

But Python 2's Unicode support is a bit more rudimentary, and the behaviour you observe varies with the specific UCS2 / UCS4 build variant; on a UCS2 build, your variables are equal:

>>> import sys
>>> sys.maxunicode
65535
>>> a1 = u'\U0001f04f'
>>> a2 = u'\ud83c\udc4f'
>>> a1 == a2
True

因为在这样的构建中,所有非BMP代码点都被编码为UTF-16代理对(在UCS2标准上扩展).

because in such a build all non-BMP codepoints are encoded as UTF-16 surrogate pairs (extending on the UCS2 standard).

因此,在UCS2构建中,两个值之间没有区别,并且当您假设要编码U + 1F04F时,选择编码为完整的非BMP代码点是完全有效和其他此类代码点. UCS4构建恰好符合该行为.

So on a UCS2 build there is no difference between your two values, and the choice to encode to the full non-BMP codepoint is entirely valid when you assume you would want to encode U+1F04F and other such codepoints. The UCS4 build just matches that behaviour.

这篇关于Python 2.7:奇怪的Unicode行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆