u'string'是否与'string'.decode('XXX')相同 [英] Is u'string' the same as 'string'.decode('XXX')

查看:95
本文介绍了u'string'是否与'string'.decode('XXX')相同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管标题是一个问题,但简短的答案显然不是。我已经尝试过了真正的问题是为什么?
ps:字符串是一些非ascii字符,例如中文,而XXX是字符串的当前编码

Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why? ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string

>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'

上面的示例。我正在使用Windows中文简体。默认编码是gbk,python shell也是如此。而且我得到了两个unicode对象不相等。

The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.

更新

a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文

>>> b = u'中文'
>>> print b
ÖÐÎÄ


推荐答案

是,<$如果编解码器可以成功解码字节,则c $ c> str.decode()通常返回 unicode 字符串。但是,如果使用正确编解码器,则这些值仅表示相同的文本。

Yes, str.decode() usually returns a unicode string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.

您的示例文本未使用正确的编解码器。您具有经GBK编码,解码为Latin1的文本:

Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:

>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'

这些值确实不相等,因为它们不是同一文本

The values are indeed not equal, because they are not the same text.

同样,重要的是使用正确的编解码器;不同的编解码器将导致非常不同的结果:

Again, it is important that you use the right codec; a different codec will result in very different results:

>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ

我将示例文本编码为Latin-1,而不是GBK或UTF-8。解码可能已经成功,但是结果文本不可读。

I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.

还请注意,粘贴非ASCII字符可以工作,因为Python解释器具有正确检测到我的终端编解码器。我可以将文本从浏览器粘贴到终端,然后将文本作为UTF-8编码的数据传递给Python。因为Python询问了终端使用了什么编解码器,所以它能够再次从 u’....’ Unicode文字值进行解码。当打印 encoded.decode('utf8') unicode 结果时,Python再一次对数据进行自动编码以适合我的终端编码。

Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....' Unicode literal value. When printing the encoded.decode('utf8') unicode result, Python once more auto-encodes the data to fit my terminal encoding.

要查看Python检测到的编解码器,请打印 sys.stdin.encoding

To see what codec Python detected, print sys.stdin.encoding:

>>> import sys
>>> sys.stdin.encoding
'UTF-8'

在以下情况下必须做出类似决定处理不同的文本来源。例如,从源文件中读取字符串文字要求您要么仅使用ASCII(对其他所有内容使用转义码),要么在文件顶部为Python提供显式编解码器标记。

Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.

我敦促您阅读:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Python Unicode HOWTO

实用Unicode ,作者Ned Batchelder

Pragmatic Unicode by Ned Batchelder

可以更全面地了解Unicode的工作原理以及Python如何处理Unicode。

to gain a more complete understanding on how Unicode works, and how Python handles Unicode.

这篇关于u'string'是否与'string'.decode('XXX')相同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆