Python Unicode 字符串和 Python 交互式解释器 [英] Python Unicode strings and the Python interactive interpreter

查看:24
本文介绍了Python Unicode 字符串和 Python 交互式解释器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解 python 2.5 如何处理 unicode 字符串.虽然现在我认为我已经很好地掌握了我应该如何在代码中处理它们,但我并不完全理解幕后发生的事情,特别是当你在解释器的提示下输入字符串时.

所以python pre 3.0对于字符串有两种类型,分别是:str(字节串)和unicode,它们都是从basestring派生出来的.字符串的默认类型是 str.

str 对象没有实际编码的概念,它们只是字节.要么你自己编码了一个 unicode 字符串,因此知道它们是什么编码,要么你已经读取了一个字节流,你事先也知道它的编码(实际上).您可以猜测您不知道其编码的字节字符串的编码,但是没有一种可靠的方法来解决这个问题.最好的办法是尽早解码,在你的代码中到处使用 unicode,然后编码.

没关系.但是输入到解释器中的字符串确实是在背后为你编码的?如果我对 Python 中字符串的理解是正确的,那么 Python 使用什么方法/设置来做出这个决定?

我困惑的根源是当我在系统的 python 安装和编辑器的嵌入式 python 控制台上尝试相同的事情时得到的不同结果.

 # 编辑器(Sublime Text)>>>s = "La caña de España">>>秒'La ca\xc3\xb1a de Espa\xc3\xb1a'>>>s.decode("utf-8")u'La ca\xf1a de Espa\xf1a'>>>sys.getdefaultencoding()'ascii'# Windows python 解释器>>>s= "西班牙拉卡尼亚">>>秒'La ca\xa4a de Espa\xa4a'>>>s.decode("utf-8")回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件C:\Python25\lib\encodings\utf_8.py",第 16 行,解码中返回 codecs.utf_8_decode(input, errors, True)UnicodeDecodeError: 'utf8' 编解码器无法解码位置 5 的字节 0xa4:意外的代码字节>>>sys.getdefaultencoding()'ascii'

解决方案

让我扩展 Ignacio 的回复:在这两种情况下,Python 和你之间都有一个额外的层:在一种情况下是 Sublime Text另一个是 cmd.exe.您看到的行为差异不是由 Python 造成的,而是由 Sublime Text(看起来是 utf-8)和 cmd.exe (cp437) 使用的不同编码造成的.

因此,当您输入 ñ 时,Sublime Text 会将 '\xc3\xb1' 发送到 Python,而 cmd.exe 发送 \xa4.[我只是在这里,省略了与问题无关的细节.].

不过,Python 知道这一点.从 cmd.exe 你可能会得到类似的东西:

<预><代码>>>>导入系统>>>sys.stdin.encoding'cp437'

而在 Sublime Text 中,你会得到类似

<预><代码>>>>导入系统>>>sys.stdin.encoding'utf-8'

I'm trying to understand how python 2.5 deals with unicode strings. Although by now I think I have a good grasp of how I'm supposed to handle them in code, I don't fully understand what's going on behind the scenes, particularly when you type strings at the interpreter's prompt.

So python pre 3.0 has two types for strings, namely: str (byte strings) and unicode, which are both derived from basestring. The default type for strings is str.

str objects have no notion of their actual encoding, they are just bytes. Either you've encoded a unicode string yourself and therefore know what encoding they are in, or you've read a stream of bytes whose encoding you also know beforehand (indeally). You can guess the encoding of a byte string whose encoding is unknown to you, but there just isn't a reliable way of figuring this out. Your best bet is to decode early, use unicode everywhere in your code and encode late.

That's fine. But strings typed into the interpreter are indeed encoded for you behind your back? Provided that my understanding of strings in Python is correct, what's the method/setting python uses to make this decision?

The source of my confusion is the differing results I get when I try the same thing on my system's python installation, and on my editor's embedded python console.

 # Editor (Sublime Text)
 >>> s = "La caña de España"
 >>> s
 'La ca\xc3\xb1a de Espa\xc3\xb1a'
 >>> s.decode("utf-8")
 u'La ca\xf1a de Espa\xf1a'
 >>> sys.getdefaultencoding()
 'ascii'

 # Windows python interpreter
 >>> s= "La caña de España"
 >>> s
 'La ca\xa4a de Espa\xa4a'
 >>> s.decode("utf-8")
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
 UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 5: unexpected code byte
 >>> sys.getdefaultencoding()
 'ascii'

解决方案

Let me expand Ignacio's reply: In both cases there is an extra layer between Python and you: in one case it is Sublime Text and in the other it's cmd.exe. The difference in behaviour you see is not due to Python but by the different encodings used by Sublime Text (utf-8, as it seems) and cmd.exe (cp437).

So, when you type ñ, Sublime Text sends '\xc3\xb1' to Python, whereas cmd.exe sends \xa4. [I'm simplyfing here, omitting details that are not relevant to the question.].

Still, Python knows about that. From cmd.exe you'll probably get something like:

>>> import sys
>>> sys.stdin.encoding
'cp437'

whereas within Sublime Text you'll get something like

>>> import sys
>>> sys.stdin.encoding
'utf-8'

这篇关于Python Unicode 字符串和 Python 交互式解释器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆