了解decode()和encode()unicode [英] understanding decode() and encode() unicode

查看:273
本文介绍了了解decode()和encode()unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法获得如何使用python2.7的功能 decode() encode()



我尝试了以下语句

 >>> s = u'abcd'
>>>> s.encode('utf8')
'abcd'
>>> s.encode('utf16')
'\xff\xfea\x00b\x00c\x00d\x00'
>>>> senen('utf32')
' \\ x00d\x00\x00\x00'

直到这里,我觉得很清楚;



但是当我的代码:

 >>> s.decode('utf8')
u'abcd'
>>> s.decode('utf16')
u'\\\扡\\\摣'
>>> s.decode('utf32')
追溯(最近的最后一次调用):
文件< stdin>,第1行,< module>
文件/usr/lib/python2.7/encodings/utf_32.py,第11行,解码
返回codecs.utf_32_decode(输入,错误,True)
UnicodeDecodeError:'utf32 '编解码器无法解码位置0-3的字节:代码点不在范围(0x110000)

为什么unicode类型的 decode()的含义?为什么第一个(用utf8)工作而不是后者?是因为python在内部使用utf-8存储unicode字符串吗?



最后一件事:

 >>> s2 ='≈'
>>>> s2
'\xe2\x89\x88'

? '≈'不是一个ascii字符,所以python会使用编码 sys.getfilesystemencoding()返回吗?隐藏转换?

解决方案

您在 unicode 字符串中调用 decode 。 Python有助于首先使用默认ASCII编解码器对字符串进行编码,以便将实际字节 解码。您不能解码Unicode数据本身,它已被解码。



解码失败,因为字节无效UTF-32数据。副作用'abcd'可解码为UTF-8,因为ASCII是UTF-8的子集。编码为ASCII,然后解码为UTF-8产生相同的信息。解码为UTF-16偶然发生作用;您提供了四个字节,十六进制值为0x61,0x62,0x63和0x64(字符 abcd 的ASCII值),这些字节可以解码为UTF-16小端 \\\扡 \\\摣 。但UTF-32编码系统中的4个字节没有有效的解码。



如果 s 有数据在其中,首先不能将其编码为ASCII,您将得到一个 UnicodeEncodeError 异常;请注意该名称中的编码

 >>> u'åßç'.decode('utf8')
追溯(最近的最后一次调用):
文件< stdin>,第1行,< module>
文件/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py,第16行,解码
返回codecs.utf_8_decode(输入,错误, True)
UnicodeEncodeError:'ascii'编解码器无法编码位置0-2中的字符:序号不在范围(128)

因为隐含编码到bytestring失败。



在Python 3中, unicode 对象已被重命名为 str ,而 str.decode()方法已从类型中删除,以防止这种混乱。只有 str.encode()保留。 Python str 类型已被字节类型替换,该类型只有一个字节。 decode()方法。



您的第二个示例显示,您正在终端或控制台中交互使用Python解释器。 Python以UTF-8字节的形式从终端收到您的输入,并将这些字节存储在bytestring中。如果你使用了一个 unicode 文字,Python将会使用您的终端声明的编码自动解码这些字节;您可以内省 sys.stdin.encoding 查看Python检测到的内容:

 >>>导入sys 
>>>> sys.stdin.encoding
'UTF-8'
>>> s ='≈'
>>> s
'\xe2\x89\x88'
>>> s =u'≈'
>>> s
u'\\\≈'
>>>打印s

反之亦然,打印 sys .stdout.encoding 编解码器用于将Unicode字符串自动编码为终端使用的编解码器,然后再次解释这些字节,以在屏幕上显示正确的字形。



如果您不在Python交互式解释器中工作,而是使用Python源文件,则使用的编解码器取决于 ,因为Python 2默认使用ASCII解码字节。 p>

sys.getfilesystemencoding()与所有这一切无关;它告诉你什么Python认为你的文件系统元数据被编码;例如目录中的文件名。当您使用 os.listdir()。之类的文件系统相关调用使用 unicode 路径时,将使用这些值。 >

I just can't get how the functions decode() and encode() work on python2.7

I tried the followings statement

>>> s = u'abcd'
>>> s.encode('utf8')
'abcd'
>>> s.encode('utf16')
'\xff\xfea\x00b\x00c\x00d\x00'
>>> s.encode('utf32')
'\xff\xfe\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00d\x00\x00\x00'

untill here, I think it's clear; encode() translate a unicode code in the corresponding utf-8/16/32 byte string.

But when I code:

>>> s.decode('utf8')
u'abcd'
>>> s.decode('utf16')
u'\u6261\u6463'
>>> s.decode('utf32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_32.py", line 11, in decode
    return codecs.utf_32_decode(input, errors, True)
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: codepoint not in range(0x110000)

why the meaning of decode() on a unicode type? Why does the first (with utf8) work instead the latters not? Is it because python internally stores unicode strings using utf-8?

One last thing:

>>> s2 = '≈'
>>> s2
'\xe2\x89\x88'

What happens under the hood? '≈' is not an ascii character, so does python convert it implicitly using the encoding sys.getfilesystemencoding() returns?

解决方案

You are calling decode on a unicode string. Python helpfully first encodes the string using the default ASCII codec so that you have actual bytes to decode. You cannot decode Unicode data itself, it is already decoded.

That decoding then fails as the bytes are not valid UTF-32 data. The bytestring 'abcd' is decodable as UTF-8, because ASCII is a subset of UTF-8. Encoding to ASCII then decoding as UTF-8 produces the same information. Decoding as UTF-16 happened to work by chance; you provided 4 bytes with hex values 0x61, 0x62, 0x63 and 0x64 (the ASCII values for the characters abcd), and those bytes can be decoded as UTF-16 little endian for \u6261 and \u6463. But there is no valid decoding for those 4 bytes in the UTF-32 encoding system.

If s had data in it that cannot be encoded to ASCII first, you'll get a UnicodeEncodeError exception; note the Encode in that name:

>>> u'åßç'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

because the implicit encoding to a bytestring failed.

In Python 3, unicode objects have been renamed to str, and the str.decode() method has been removed from the type to prevent this kind of confusion. Only str.encode() remains. The Python str type has been replaced by the bytes type, which only has an bytes.decode() method.

Your second example shows that you are using the Python interpreter interactively in a terminal or console. Python received your input from the terminal as UTF-8 bytes and stored those bytes in a bytestring. Had you used a unicode literal, Python would have automatically decoded those bytes using the encoding declared for your terminal; you can introspect sys.stdin.encoding to see what Python detected:

>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> s = '≈'
>>> s
'\xe2\x89\x88'
>>> s = u'≈'
>>> s
u'\u2248'
>>> print s
≈

Vice-versa, when printing the sys.stdout.encoding codec is used to auto-encode Unicode strings to the codec used by your terminal, which then interprets those bytes again to display the right glyphs on your screen.

If you are not working in the Python interactive interpreter but are instead working with a Python source file, the codec to use is instead determined by the PEP-263 Python source code encodings declaration, as Python 2 otherwise defaults to decoding bytes as ASCII.

sys.getfilesystemencoding() has nothing to do with all this; it tells you what Python think your file system metadata is encoded with; e.g. the filenames in directories. The values is used when you use unicode paths for filesystem-related calls like os.listdir().

这篇关于了解decode()和encode()unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆