python 2和3中的UTF-8字符串 [英] UTF-8 string in python 2 and 3

查看:76
本文介绍了python 2和3中的UTF-8字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码在Python 3中有效:

The following code works in Python 3:

people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

并产生以下输出:

Nicholas Gyeney, André  
Writers: Nicholas Gyeney, André

但是,在Python 2.7中,出现以下错误:

In Python 2.7, though, I get the following error:

Traceback (most recent call last):
  File "python", line 4, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 21: ordinal not in range(128)

我可以通过将", ".join(people)更改为", ".join(people).encode('utf-8')来解决此错误,但是如果这样做,Python 3中的输出将更改为:

I can fix this error by changing ", ".join(people) to ", ".join(people).encode('utf-8'), but if I do so, the output in Python 3 changes to:

b'Nicholas Gyeney, Andr\xc3\xa9'  
Writers: b'Nicholas Gyeney, Andr\xc3\xa9'

所以我尝试使用以下代码:

So I tried to use the following code:

if sys.version_info < (3, 0):
    reload(sys)
    sys.setdefaultencoding('utf-8')

people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

这使我的代码在所有版本的Python中都能正常工作.但我读到不鼓励使用setdefaultencoding .

Which makes my code work in all versions of Python. But I read that using setdefaultencoding is discouraged.

解决此问题的最佳方法是什么?

What's the best approach to deal with this issue?

推荐答案

首先,我们假设您要支持Python 2.7和3.5版本(2.6和3.0到3.2的处理方式有所不同).

First we assume that you want to support Python 2.7 and 3.5 versions (2.6 and 3.0 to 3.2 are handled a bit differently).

您已经阅读过,不鼓励使用setdefaultencoding,实际上在您的情况下并不需要它.

As you have already read, setdefaultencoding is discouraged and actually not needed in your case.

要编写处理unicode文本的跨平台代码,通常只需要在多个位置指定字符串编码即可.

To write cross platform code dealing with unicode text, you generally only need to specify string encoding at several places:

  1. 在脚本顶部,在带# -*- coding: utf-8 -*-的shebang下方(仅当您的代码中包含带有unicode文本的字符串文字时)
  2. 当您读取输入数据时(例如从文本文件或数据库中读取)
  3. 输出数据时(同样是文本文件或数据库中的数据)
  4. 在代码中定义字符串文字时
  1. At top of your script, below the shebang with # -*- coding: utf-8 -*- (only if you have string literals with unicode text in your code)
  2. When you read input data (eg. from text file or database)
  3. When you output data (again from text file or database)
  4. When you define a string literal in code

这是我通过遵循以下规则来更改您的示例的方式:

Here is how I changed your example by following those rules:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

people = ['Nicholas Gyeney', 'André']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

print(type(writers))
print(len(writers))

输出:

<type 'str'>
23

这是什么变化:

  • 文件顶部的指定文件编码
  • \xe9替换为实际的Unicode字符(é)
  • 删除了u前缀
  • Specified file encoding at top of file
  • Replaced \xe9 with the actual Unicode character (é)
  • Removed u prefixes

它在Python 2.7.12和3.5.2中很好地工作了.

It works just nicely in Python 2.7.12 and 3.5.2.

但请注意,删除u前缀将使python使用常规的str类型而不是unicode(请参见print(type(writers))的输出).在utf-8的情况下,它在大多数地方都像是unicode字符串一样工作,但是在检查文本长度时,将返回错误的值.在此示例中,len返回23,其中实际字符数为22.这是因为基础类型是str,该类型将每个字节计为一个字符,但是字符é实际上应该是两个字节.

But be warned that removing the u prefixes will make python use regular str type instead of unicode (see output of print(type(writers))). In case of utf-8 it works in most places as if it were a unicode string, but when checking the text length a wrong value will be returned. In this example len returns 23, where the actual number of characters is 22. This is because the underlying type is str, which counts each byte as a character, but character é should actually be two bytes.

换句话说,这可以在输出数据很好的情况下起作用(如您的示例所示),但是如果您想对文本进行字符串操作则不行.在这种情况下,在操作字符串之前,您仍然需要使用u前缀或将数据明确转换为unicode类型.

In other words this works when outputing data fine (as in your example), but not if you want to do string manipulation on the text. In this case, you still need to use the u prefix or convert the data to unicode type excplicitly, before string manipulation.

因此,如果不是您的简单示例,最好还是使用u前缀.您需要在两个地方使用它:

So, if it was not for your simple example, it would be better to still use the u prefix. You need that in two places:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

people = [u'Nicholas Gyeney', u'André']
writers = ", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))

print(type(writers))
print(len(writers))

输出:

<type 'unicode'>
22

注意:u前缀在Python 3.0中已删除,然后在Python 3.3中再次引入,以实现向后兼容性.

Note: u prefix was removed in Python 3.0 and then reintroduced again in Python 3.3 for backward compatibility.

在官方文档中可以找到有关在Python 2中处理unicode文本的所有复杂性的详细说明: Python 2-Unicode HOWTO .

Detailed explanation of all intricacies of working with unicode text in Python 2 is available in official documentation: Python 2 - Unicode HOWTO.

以下是指定文件编码的特殊注释的摘录:

Here is an excerpt for the special comment specifying file encoding:

Python支持以任何编码形式编写Unicode文字,但是您有 声明正在使用的编码.这可以通过添加一个 作为源文件的第一行或第二行的特殊注释:

Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = u'abcdé' print ord(u[-1])

该语法受Emacs用于指定变量的符号的启发 文件本地. Emacs支持许多不同的变量,但是Python 仅支持coding. -*-符号向Emacs指示 评论很特别;它们对Python没有意义,但是 习俗. Python在 评论.

The syntax is inspired by Emacs’s notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports coding. The -*- symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention. Python looks for coding: name or coding=name in the comment.

如果您不添加此类评论,则使用的默认编码为 ASCII.

If you don’t include such a comment, the default encoding used will be ASCII.

如果您掌握了《 学习Python,第5版"一书,我鼓励您阅读第八部分的第37章"Unicode和字节字符串".高级主题.它包含有关在两代Python中使用Unicode文本的详细说明.

If you get get hold of the book "Learning Python, 5th Edition", I encourage you to read Chapter 37 "Unicode and Byte Strings" in Part VIII. Advanced Topics. It contains detailed explanation for working with Unicode text in both generations of Python.

另一个值得一提的细节是,如果格式字符串为ascii,则format总是返回ascii字符串,无论参数是否位于unicode中.

Another detail worth mentioning is that format always returns an ascii string if the format string was ascii, no matter that the arguments were in unicode.

与此相反,如果任何参数为unicode,则使用%的旧格式格式化将返回unicode字符串.所以不用写这个

Contrary to that, old style formatting with % returns a unicode string if any of the arguments are unicode. So instead of writing this

print(u"Writers: {}".format(writers))

您可以编写此代码,它不仅更短,更漂亮,而且可以在Python 2和3中使用.

you could write this, which is not only shorter and prettier, but works in both Python 2 and 3:

print("Writers: %s" % writers)

这篇关于python 2和3中的UTF-8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆