什么是unicode字符串? [英] What is a unicode string?
问题描述
什么是unicode字符串?
What exactly is a unicode string?
常规字符串和unicode字符串有什么区别?
What's the difference between a regular string and unicode string?
什么是utf-8?
我现在正在尝试学习Python,而且我一直听到这个流行词.下面的代码是做什么的?
I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?
i18n字符串(Unicode)
> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'
## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1' ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8') ## Convert bytes back to a unicode string
> t == ustring ## It's the same as the original, yay!
True
文件Unicode
import codecs
f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string
推荐答案
此答案与Python 2有关.在Python 3中,str
是Unicode字符串.
This answer is about Python 2. In Python 3, str
is a Unicode string.
Python的str
类型是8位字符的集合.可以使用这些8位字符来表示英文字母,但不能使用±,♠,Ω和symbols等符号.
Python's str
type is a collection of 8-bit characters. The English alphabet can be represented using these 8-bit characters, but symbols such as ±, ♠, Ω and ℑ cannot.
Unicode 是用于处理多种字符的标准.每个符号都有一个代码点(一个数字),并且可以使用多种编码对这些代码点进行编码(转换为字节序列).
Unicode is a standard for working with a wide range of characters. Each symbol has a codepoint (a number), and these codepoints can be encoded (converted to a sequence of bytes) using a variety of encodings.
UTF-8 是一种这样的编码.低码点使用单个字节进行编码,高码点使用字节序列进行编码.
UTF-8 is one such encoding. The low codepoints are encoded using a single byte, and higher codepoints are encoded as sequences of bytes.
Python的unicode
类型是代码点的集合. ustring = u'A unicode \u018e string \xf1'
行创建一个包含20个字符的Unicode字符串.
Python's unicode
type is a collection of codepoints. The line ustring = u'A unicode \u018e string \xf1'
creates a Unicode string with 20 characters.
当Python解释器显示ustring
的值时,它将转义两个字符(Ǝ和ñ),因为它们不在标准可打印范围内.
When the Python interpreter displays the value of ustring
, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.
第s = unistring.encode('utf-8')
行使用UTF-8对Unicode字符串进行编码.这会将每个代码点转换为适当的字节或字节序列.结果是字节的集合,将其作为str
返回. s
的大小为22个字节,因为其中两个字符具有较高的代码点,并且被编码为两个字节而不是单个字节的序列.
The line s = unistring.encode('utf-8')
encodes the Unicode string using UTF-8. This converts each codepoint to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str
. The size of s
is 22 bytes, because two of the characters have high codepoints and are encoded as a sequence of two bytes rather than a single byte.
当Python解释器显示s
的值时,它将转义四个不在可打印范围内的字节(\xc6
,\x8e
,\xc3
和\xb1
).两对字节不像以前那样被视为单个字符,因为s
的类型为str
,而不是unicode
.
When the Python interpreter displays the value of s
, it escapes four bytes that are not in the printable range (\xc6
, \x8e
, \xc3
, and \xb1
). The two pairs of bytes are not treated as single characters like before because s
is of type str
, not unicode
.
行t = unicode(s, 'utf-8')
与encode()
相反.它通过查看s
的字节并解析字节序列来重建原始代码点.结果是一个Unicode字符串.
The line t = unicode(s, 'utf-8')
does the opposite of encode()
. It reconstructs the original codepoints by looking at the bytes of s
and parsing byte sequences. The result is a Unicode string.
对codecs.open()
的调用将utf-8
指定为编码,这告诉Python将文件内容(字节的集合)解释为已使用UTF-8编码的Unicode字符串.
The call to codecs.open()
specifies utf-8
as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.
这篇关于什么是unicode字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!