Python str 与 unicode 类型 [英] Python str vs unicode types

查看:50
本文介绍了Python str 与 unicode 类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Python 2.7,我想知道使用 unicode 类型而不是 str 类型有什么真正的优势,因为它们似乎都能够保持Unicode 字符串.除了能够使用转义字符 \unicode 字符串中设置 Unicode 代码之外,还有什么特殊原因吗?:

执行一个模块:

# -*- 编码:utf-8 -*-a = 'á'ua = u'á'打印一个, ua

结果为:á、á

使用 Python shell 进行更多测试:

<预><代码>>>>a = 'á'>>>一种'\xc3\xa1'>>>ua = u'á'>>>UA你'\xe1'>>>ua.encode('utf8')'\xc3\xa1'>>>ua.encode('latin1')'\xe1'>>>UA你'\xe1'

因此,unicode 字符串似乎使用 latin1 而不是 utf-8 编码,原始字符串使用 编码utf-8?我现在更糊涂了!:S

解决方案

unicode 用于处理文本.文本是一系列代码点可能大于一个字节.文本可以用特定的编码编码以将文本表示为原始字节(例如utf-8latin-1...).

注意unicode 没有编码!python使用的内部表示是一个实现细节,只要能够表示你想要的代码点,你就不必关心它.

相反,Python 2 中的 str 是一个简单的 字节 序列.不代表文字!

您可以将 unicode 视为某些文本的一般表示,可以以多种不同方式将其编码为通过 str 表示的二进制数据序列.

注意:在 Python 3 中,unicode 被重命名为 str 并且有一个新的 bytes 类型用于一个简单的序列字节.

您可以看到的一些差异:

<预><代码>>>>len(u'à') # 单个代码点1>>>len('à') # 默认为 utf-8 ->需要两个字节2>>>len(u'à'.encode('utf-8'))2>>>len(u'à'.encode('latin1')) # 在 latin1 中需要一个字节1>>>print u'à'.encode('utf-8') # 终端编码为utf-8一种>>>print u'à'.encode('latin1') # 无法理解 latin1 字节

请注意,使用 str 您可以对特定编码表示的单个字节进行较低级别的控制,而使用 unicode 您只能在代码点进行控制等级.例如你可以这样做:

<预><代码>>>>'àèìòù''\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'>>>打印 'àèìòù'.replace('\xa8', '')à ìòù

以前是有效的 UTF-8,现在不再是了.使用 unicode 字符串时,您不能以生成的字符串不是有效的 unicode 文本的方式进行操作.您可以删除代码点,用不同的代码点替换代码点等,但不能弄乱内部表示.

Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:

Executing a module with:

# -*- coding: utf-8 -*-

a = 'á'
ua = u'á'
print a, ua

Results in: á, á

EDIT:

More testing using Python shell:

>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'

So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S

解决方案

unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).

Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.

On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.

Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

Some differences that you can see:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.

这篇关于Python str 与 unicode 类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆