Python str 与 unicode 类型 [英] Python str vs unicode types
问题描述
使用 Python 2.7,我想知道使用 unicode
类型而不是 str
类型有什么真正的优势,因为它们似乎都能够保持Unicode 字符串.除了能够使用转义字符 \
在 unicode
字符串中设置 Unicode 代码之外,还有什么特殊原因吗?:
执行一个模块:
# -*- 编码:utf-8 -*-a = 'á'ua = u'á'打印一个, ua
结果为:á、á
使用 Python shell 进行更多测试:
<预><代码>>>>a = 'á'>>>一种'\xc3\xa1'>>>ua = u'á'>>>UA你'\xe1'>>>ua.encode('utf8')'\xc3\xa1'>>>ua.encode('latin1')'\xe1'>>>UA你'\xe1'因此,unicode
字符串似乎使用 latin1
而不是 utf-8
编码,原始字符串使用 编码utf-8
?我现在更糊涂了!:S
unicode
用于处理文本.文本是一系列代码点,可能大于一个字节.文本可以用特定的编码编码以将文本表示为原始字节(例如utf-8
、latin-1
...).
注意unicode
没有编码!python使用的内部表示是一个实现细节,只要能够表示你想要的代码点,你就不必关心它.
相反,Python 2 中的 str
是一个简单的 字节 序列.不代表文字!
您可以将 unicode
视为某些文本的一般表示,可以以多种不同方式将其编码为通过 str
表示的二进制数据序列.>
注意:在 Python 3 中,unicode
被重命名为 str
并且有一个新的 bytes
类型用于一个简单的序列字节.
您可以看到的一些差异:
<预><代码>>>>len(u'à') # 单个代码点1>>>len('à') # 默认为 utf-8 ->需要两个字节2>>>len(u'à'.encode('utf-8'))2>>>len(u'à'.encode('latin1')) # 在 latin1 中需要一个字节1>>>print u'à'.encode('utf-8') # 终端编码为utf-8一种>>>print u'à'.encode('latin1') # 无法理解 latin1 字节请注意,使用 str
您可以对特定编码表示的单个字节进行较低级别的控制,而使用 unicode
您只能在代码点进行控制等级.例如你可以这样做:
以前是有效的 UTF-8,现在不再是了.使用 unicode 字符串时,您不能以生成的字符串不是有效的 unicode 文本的方式进行操作.您可以删除代码点,用不同的代码点替换代码点等,但不能弄乱内部表示.
Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode
instead of str
, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode
strings using the escape char \
?:
Executing a module with:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
EDIT:
More testing using Python shell:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode
string seems to be encoded using latin1
instead of utf-8
and the raw string is encoded using utf-8
? I'm even more confused now! :S
unicode
is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8
, latin-1
...).
Note that unicode
is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
On the contrary str
in Python 2 is a plain sequence of bytes. It does not represent text!
You can think of unicode
as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str
.
Note: In Python 3, unicode
was renamed to str
and there is a new bytes
type for a plain sequence of bytes.
Some differences that you can see:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Note that using str
you have a lower-level control on the single bytes of a specific encoding representation, while using unicode
you can only control at the code-point level. For example you can do:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
这篇关于Python str 与 unicode 类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!