准备从 Python 2.x 转换到 3.x [英] Getting ready to convert from Python 2.x to 3.x

查看:69
本文介绍了准备从 Python 2.x 转换到 3.x的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

众所周知(我希望如此),Python 3 正在慢慢开始取代 Python 2.x.当然,大部分现有代码最终移植还需要很多年,但我们现在可以在 2.x 版代码中做一些事情来简化转换.

As we all know by now (I hope), Python 3 is slowly beginning to replace Python 2.x. Of course it will be many MANY years before most of the existing code is finally ported, but there are things we can do right now in our version 2.x code to make the switch easier.

显然是查看 3 中的新内容.x 会有所帮助,但是我们现在可以做些什么来使即将到来的转换更加轻松(以及在需要时更容易地将更新输出到并发版本)?我特别在考虑我们可以开始我们的脚本的行,这将使早期版本的 Python 更类似于 3.x,但也欢迎其他习惯.

Obviously taking a look at what's new in 3.x will be helpful, but what are some things we can do right now to make the upcoming conversion more painless (as well as make it easier to output updates to concurrent versions if needed)? I'm specifically thinking about lines we can start our scripts off with that will make earlier versions of Python more similar to 3.x, though other habits are also welcome.

我能想到的最明显的添加到脚本顶部的代码是:

The most obvious code to add to the top of the script that I can think of is:

from __future__ import division
from __future__ import print_function
try:
    range = xrange
except NameError:
    pass

我能想到的最明显的习惯是"{0} {1}!".format("Hello", "World") 用于字符串格式化.

The most obvious habit thing I can think of is "{0} {1}!".format("Hello", "World") for string formatting.

还有什么要养成的习惯和好习惯吗?

Any other lines and good habits to get into?

推荐答案

微观变化和 2to3 无法充分解决的最大问题是将默认字符串类型从字节更改为 Unicode.

The biggest problem that cannot be adequately addressed by micro-level changes and 2to3 is the change of the default string type from bytes to Unicode.

如果您的代码需要对编码和字节 I/O 执行任何操作,则需要大量手动操作才能正确转换,以便必须是字节的内容仍然是字节,并在正确的阶段进行适当解码.您会发现某些字符串方法(特别是 format())和库调用需要 Unicode 字符串,因此您可能需要额外的解码/编码周期才能将字符串用作 Unicode,即使它们是真的只是字节.

If your code needs to do anything with encodings and byte I/O, it's going to need a bunch of manual effort to convert correctly, so that things that have to be bytes remain bytes, and are decoded appropriately at the right stage. You'll find that some string methods (in particular format()) and library calls require Unicode strings, so you may need extra decode/encode cycles just to use the strings as Unicode even if they're really just bytes.

由于某些 Python 标准库模块已使用 2to3 进行粗略转换,而没有适当注意字节/unicode/编码问题,因此它们自己会犯关于哪种字符串类型合适的错误,这无济于事.其中一些正在被解决,但至少从 Python 3.0 到 3.2,您将面临来自需要了解字节编码的包(如 urllib、email 和 wsgiref)的混乱和潜在错误行为.

This is not helped by the fact that some of the Python standard library modules have been crudely converted using 2to3 without proper attention to bytes/unicode/encoding issues, and so themselves make mistakes about what string type is appropriate. Some of this is being thrashed out, but at least from Python 3.0 to 3.2 you will face confusing and potentially buggy behaviour from packages like urllib, email and wsgiref that need to know about byte encodings.

您可以通过在每次编写字符串文字时小心点来改善问题.将 u'' 字符串用于任何本质上基于字符的内容,b'' 字符串用于任何真正的字节,'' 用于 '默认字符串'类型无关紧要,或者您需要匹配库调用的字符串使用要求.

You can ameliorate the problem by being careful every time you write a string literal. Use u'' strings for anything that's inherently character-based, b'' strings for anything that's really bytes, and '' for the ‘default string’ type where it doesn't matter or you need to match a library call's string use requirements.

不幸的是,b'' 语法仅在 Python 2.6 中引入,因此这样做会切断早期版本的用户.

Unfortunately the b'' syntax was only introduced in Python 2.6, so doing this cuts off users of earlier versions.

eta:

有什么区别?

天啊.嗯...

一个字节包含 0-255 范围内的值,可能代表二进制数据(例如图像的内容)或一些文本的负载,在这种情况下,必须为如何映射选择一个标准将一组字符放入这些字节中.这些编码"标准中的大多数都以相同的方式将普通的ASCII"字符集映射到字节 0-127,因此在 Python 2 中使用字节字符串进行纯 ASCII 文本处理通常是安全的.

A byte contains a value in the range 0–255, and may represent a load of binary data (eg. the contents of an image) or some text, in which case there has to be a standard chosen for how to map a set of characters into those bytes. Most of these ‘encoding’ standards map the normal ‘ASCII’ character set into the bytes 0–127 in the same way, so it's generally safe to use byte strings for ASCII-only text processing in Python 2.

如果你想在字节串中使用 ASCII 集之外的任何字符,你就有麻烦了,因为每种编码都将一组不同的字符映射到剩余的字节值 128-255 中,而大多数编码都可以t 将每个可能的字符映射到字节.这是所有这些问题的根源,当您将文件从一个语言环境加载到另一个语言环境的 Windows 应用程序时,所有重音或非拉丁字母都更改为错误的字母,从而造成无法阅读的混乱.(又名mojibake".)

If you want to use any of the characters outside the ASCII set in a byte string, you're in trouble, because each encoding maps a different set of characters into the remaining byte values 128–255, and most encodings can't map every possible character to bytes. This is the source of all those problems where you load a file from one locale into a Windows app in another locale and all the accented or non-Latin letters change to the wrong ones, making an unreadable mess. (aka ‘mojibake’.)

还有多字节"编码,它尝试通过使用多个字节来存储每个字符来将更多字符放入可用空间.这些是为东亚语言环境引入的,因为汉字太多了.但还有 UTF-8,这是一种设计更好的现代多字节编码,可以容纳每个字符.

There are also ‘multibyte’ encodings, which try to fit more characters into the available space by using more than one byte to store each character. These were introduced for East Asian locales, as there are so very many Chinese characters. But there's also UTF-8, a better-designed modern multibyte encoding which can accommodate every character.

如果您正在处理多字节编码的字节字符串——今天您可能会这样做,因为 UTF-8 的使用非常广泛;实际上,在现代应用程序中不应该使用任何其他编码——那么你会遇到更多的问题,而不仅仅是跟踪你正在使用的编码.len() 将告诉您以字节为单位的长度,而不是以字符为单位的长度,如果您开始索引和更改字节,您很可能会将多字节序列分成两部分,生成一个无效的序列,通常会混淆一切.

If you are working on byte strings in a multibyte encoding—and today you probably will be, because UTF-8 is very widely used; really, no other encoding should be used in a modern application—then you've got even more problems than just keeping track of what encoding you're playing with. len() is going to be telling you the length in bytes, not the length in characters, and if you start indexing and altering the bytes you're very likely to break a multibyte sequence in two, generating an invalid sequence and generally confusing everything.

因此,Python 1.6 及更高版本具有原生 Unicode 字符串(拼写为 u'something'),其中字符串中的每个单元都是一个字符,而不是一个字节.您可以 len() 对它们进行切片、替换、正则表达式,它们将始终表现得恰到好处.对于文本处理任务,它们无疑更好,这就是 Python 3 将它们设为默认字符串类型的原因(无需在 '' 之前放置 u).

For this reason, Python 1.6 and later have native Unicode strings (spelled u'something'), where each unit in the string is a character, not a byte. You can len() them, slice them, replace them, regex them, and they'll always behave appropriately. For text processing tasks they are indubitably better, which is why Python 3 makes them the default string type (without having to put a u before the '').

问题在于,许多现有接口(例如 Windows 以外的操作系统上的文件名、HTTP 或 SMTP)主要是基于字节的,并以单独的方式指定编码.因此,当您处理需要字节的组件时,您必须注意将您的 unicode 字符串正确编码为字节,而在 Python 3 中,您将必须在某些以前不需要的地方明确地进行编码.

The catch is that a lot of existing interfaces, such as filenames on OSes other than Windows, or HTTP, or SMTP, are primarily byte-based, with a separate way of specifying the encoding. So when you are dealing with components that need bytes you have to take care to encode your unicode strings to bytes correctly, and in Python 3 you will have to do it explicitly in some places where before you didn't need to.

Unicode 字符串在内部每个单元占用两个字节"的存储空间是一个内部实现细节.你永远看不到那个存储;你不应该用字节来考虑它.您正在处理的单元在概念上是字符,无论 Python 选择如何在内存中表示它们.

It is an internal implementation detail that Unicode strings take ‘two bytes’ of storage per unit internally. You never get to see that storage; you shouldn't think of it in terms of bytes. The units you are working on are conceptually characters, regardless of how Python chooses to represent them in memory.

...旁白:

这并不完全正确.在像 Windows 构建这样的 Python 的窄构建"中,Unicode 字符串的每个单元在技术上不是一个字符,而是一个 UTF-16代码单元".对于基本多语言位面中的字符,从 0x0000 到 0xFFFF,您不会注意到任何区别,但是如果您使用这个 16 位范围之外的字符,那些在星光位面"中的字符,您会发现它们两个单位而不是一个单位,同样,在对它们进行切片时,您可能会分裂一个字符.

This isn't quite true. On ‘narrow builds’ of Python like the Windows build, each unit of a Unicode string is not technically a character, but a UTF-16 ‘code unit’. For the characters in the Basic Multilingual Plane, from 0x0000–0xFFFF you won't notice any difference, but if you're using characters from outside this 16-bit range, those in the ‘astral planes’, you'll find they take two units instead of one, and, again, you risk splitting a character when you slice them.

这很糟糕,并且发生了,因为在 Unicode 超过 65,000 个字符的限制之前,Windows(和其他人,例如 Java)将 UTF-16 确定为内存中存储机制.然而,这些扩展字符的使用仍然非常少见,Windows 上的任何人都会习惯它们在许多应用程序中被破坏,因此这对您来说可能并不重要.

This is pretty bad, and has happened because Windows (and others, such as Java) settled on UTF-16 as an in-memory storage mechanism before Unicode grew beyond the 65,000-character limit. However, use of these extended characters is still pretty rare, and anyone on Windows will be used to them breaking in many applications, so it's likely not critical for you.

在wide builds"中,Unicode 字符串由真正的字符代码点"单元组成,因此即使是 BMP 之外的扩展字符也可以一致且轻松地处理.为此付出的代价是效率:每个字符串单元在内存中占用 4 个字节的存储空间.

On ‘wide builds’, Unicode strings are made of real character ‘code point’ units, so even the extended characters outside of the BMP can be handled consistently and easily. The price to pay for this is efficiency: each string unit takes up four bytes of storage in memory.

这篇关于准备从 Python 2.x 转换到 3.x的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆