Python 2.7 中特定于平台的 Unicode 语义 [英] platform specific Unicode semantics in Python 2.7

查看:24
本文介绍了Python 2.7 中特定于平台的 Unicode 语义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Ubuntu 11.10:

Ubuntu 11.10:

$ python
Python 2.7.2+ (default, Oct  4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
1
>>> ord(x[0])
128077

Windows 7:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
2
>>> ord(x[0])
55357

我的 Ubuntu 经验是使用发行版中的默认解释器.对于 Windows 7,我下载并安装了从 python.org 链接的推荐版本.我没有自己编译它们.

My Ubuntu experience is with the default interpreter in the distribution. For Windows 7 I downloaded and installed the recommended version linked from python.org. I did not compile either of them myself.

差异的本质对我来说很清楚.(在 Ubuntu 上,字符串是一系列代码点;在 Windows 7 上,是一系列 UTF-16 代码单元.)我的问题是:

The nature of the difference is clear to me. (On Ubuntu the string is a sequence of code points; on Windows 7 a sequence of UTF-16 code units.) My questions are:

  • 为什么我会观察到这种行为差异?是由于解释器的构建方式,还是依赖系统库的差异?
  • 有什么方法可以配置 Windows 7 解释器的行为以与 Ubuntu 一致,我可以在 Eclipse PyDev 中做到这一点(我的目标)?
  • 如果我必须重建,是否有任何来自可靠来源的预构建的 Windows 7 解释器可以像上面的 Ubuntu 一样工作?
  • 除了仅在 Windows (blech) 上手动计算 unicode 字符串中的代理项外,是否还有其他解决方法?
  • 这是否可以证明错误报告的合理性?是否有可能在 2.7 中解决此类错误报告?
  • Why am I observing this difference in behavior? Is it due to how the interpreter is built, or a difference in dependent system libraries?
  • Is there any way to configure the behavior of the Windows 7 interpreter to agree with the Ubuntu one, that I can do within Eclipse PyDev (my goal)?
  • If I have to rebuild, are there any prebuilt Windows 7 interpreters that behave as Ubuntu above from a reliable source?
  • Are there any workarounds to this issue besides manually counting surrogates in unicode strings on Windows only (blech)?
  • Does this justify a bug report? Is there any chance such a bug report would be addressed in 2.7?

推荐答案

在 Ubuntu 上,您有一个 "wide" Python build,其中字符串是 UTF-32/UCS-4.不幸的是,这(尚)不适用于 Windows.

On Ubuntu, you have a "wide" Python build where strings are UTF-32/UCS-4. Unfortunately, this isn't (yet) available for Windows.

Windows 构建将在一段时间内缩小,因为有对宽字符的请求很少,这些请求大多是来自有能力购买自己的 Python 的核心程序员而 Windows 本身强烈偏向于 16 位字符.

Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters.

Python 3.3 将具有灵活的字符串表示,在其中您不会需要关心 Unicode 字符串是使用 16 位还是 32 位代码单元.

Python 3.3 will have flexible string representation, in which you will not need to care about whether Unicode strings use 16-bit or 32-bit code units.

在那之前,您可以使用 UTF-16 字符串获取代码点

Until then, you can get the code points from a UTF-16 string with

def code_points(text):
    utf32 = text.encode('UTF-32LE')
    return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)

这篇关于Python 2.7 中特定于平台的 Unicode 语义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆