什么是unicode字符串? [英] What is a unicode string?

查看:849
本文介绍了什么是unicode字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是unicode字符串?

What exactly is a unicode string?

常规字符串和unicode字符串有什么区别?

What's the difference between a regular string and unicode string?

什么是utf-8?

我现在正在尝试学习Python,而且我一直听到这个流行词.下面的代码是做什么的?

I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

i18n字符串(Unicode)

> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'

## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string
> t == ustring                      ## It's the same as the original, yay!
True

文件Unicode

import codecs

f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string

推荐答案

此答案与Python 2有关.在Python 3中,str是Unicode字符串.

This answer is about Python 2. In Python 3, str is a Unicode string.

Python的str类型是8位字符的集合.可以使用这些8位字符来表示英文字母,但不能使用±,♠,Ω和symbols等符号.

Python's str type is a collection of 8-bit characters. The English alphabet can be represented using these 8-bit characters, but symbols such as ±, ♠, Ω and ℑ cannot.

Unicode 是用于处理多种字符的标准.每个符号都有一个代码点(一个数字),并且可以使用多种编码对这些代码点进行编码(转换为字节序列).

Unicode is a standard for working with a wide range of characters. Each symbol has a codepoint (a number), and these codepoints can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 是一种这样的编码.低码点使用单个字节进行编码,高码点使用字节序列进行编码.

UTF-8 is one such encoding. The low codepoints are encoded using a single byte, and higher codepoints are encoded as sequences of bytes.

Python的unicode类型是代码点的集合. ustring = u'A unicode \u018e string \xf1'行创建一个包含20个字符的Unicode字符串.

Python's unicode type is a collection of codepoints. The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

当Python解释器显示ustring的值时,它将转义两个字符(Ǝ和ñ),因为它们不在标准可打印范围内.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

s = unistring.encode('utf-8')行使用UTF-8对Unicode字符串进行编码.这会将每个代码点转换为适当的字节或字节序列.结果是字节的集合,将其作为str返回. s的大小为22个字节,因为其中两个字符具有较高的代码点,并且被编码为两个字节而不是单个字节的序列.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each codepoint to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high codepoints and are encoded as a sequence of two bytes rather than a single byte.

当Python解释器显示s的值时,它将转义四个不在可打印范围内的字节(\xc6\x8e\xc3\xb1).两对字节不像以前那样被视为单个字符,因为s的类型为str,而不是unicode.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

t = unicode(s, 'utf-8')encode()相反.它通过查看s的字节并解析字节序列来重建原始代码点.结果是一个Unicode字符串.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original codepoints by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

codecs.open()的调用将utf-8指定为编码,这告诉Python将文件内容(字节的集合)解释为已使用UTF-8编码的Unicode字符串.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

这篇关于什么是unicode字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆