为什么 Python 默认编码为 ASCII 时打印 unicode 字符? [英] Why does Python print unicode characters when the default encoding is ASCII?

查看:36
本文介绍了为什么 Python 默认编码为 ASCII 时打印 unicode 字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自 Python 2.6 shell:

<预><代码>>>>导入系统>>>打印 sys.getdefaultencoding()ASCII码>>>打印 u'xe9'é>>>

我预计在打印语句之后会出现一些乱码或错误,因为é"字符不是 ASCII 的一部分,而且我还没有指定编码.我想我不明白 ASCII 作为默认编码意味着什么.

编辑

我将修改移至答案部分并按照建议接受.

解决方案

感谢各种回复的点点滴滴,我想我们可以拼凑一个解释.

通过尝试打印 unicode 字符串 u'xe9',Python 会隐式地尝试使用当前存储在 sys.stdout.encoding 中的编码方案对该字符串进行编码.Python 实际上是从它被启动的环境中获取这个设置的.如果它无法从环境中找到正确的编码,那么它才会恢复到它的默认,ASCII.

例如,我使用默认编码为 UTF-8 的 bash shell.如果我从它启动 Python,它会选择并使用该设置:

$ python>>>导入系统>>>打印 sys.stdout.encodingUTF-8

让我们暂时退出 Python shell 并使用一些虚假编码设置 bash 的环境:

$ export LC_CTYPE=klingon# 我们应该在这里得到一些错误信息,忽略它.

然后再次启动 python shell 并验证它确实恢复到其默认的 ascii 编码.

$ python>>>导入系统>>>打印 sys.stdout.encodingANSI_X3.4-1968

宾果游戏!

如果您现在尝试在 ascii 之外输出一些 unicode 字符,您应该会收到一条很好的错误消息

<预><代码>>>>打印 u'xe9'UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'xe9'在位置 0:序号不在范围内 (128)

<小时>

让我们退出 Python 并丢弃 bash shell.

我们现在将观察 Python 输出字符串后会发生什么.为此,我们将首先在图形终端(我使用 Gnome 终端)中启动一个 bash shell,然后我们将终端设置为使用 ISO-8859-1 aka latin-1 解码输出(图形终端通常可以选择 在其下拉菜单之一中设置字符编码).请注意,这不会改变实际的shell 环境 编码,它只会改变终端 本身对给出的输出进行解码的方式,有点像网络浏览器所做的.因此,您可以独立于 shell 环境更改终端的编码.然后让我们从 shell 启动 Python 并验证 sys.stdout.encoding 是否设置为 shell 环境的编码(对我来说是 UTF-8):

$ python>>>导入系统>>>打印 sys.stdout.encodingUTF-8>>>打印 'xe9' # (1)é>>>打印 u'xe9' # (2)é>>>打印 u'xe9'.encode('latin-1') # (3)é>>>

(1) python 按原样输出二进制字符串,终端接收它并尝试将其值与 latin-1 字符映射匹配.在 latin-1 中,0xe9 或 233 产生字符é",这就是终端显示的内容.

(2) python 尝试使用当前在 sys.stdout.encoding 中设置的任何方案隐式编码 Unicode 字符串,在本例中它是UTF-8".经过 UTF-8 编码后,得到的二进制字符串是 'xc3xa9'(见后面的解释).终端接收这样的流并尝试使用 latin-1 解码 0xc3a9,但 latin-1 从 0 到 255,因此,一次只解码 1 个字节的流.0xc3a9 是 2 个字节长,因此 latin-1 解码器将其解释为 0xc3 (195) 和 0xa9 (169) 并产生 2 个字符:Ã 和 ©.

(3) python 使用 latin-1 方案对 unicode 代码点 u'xe9' (233) 进行编码.结果表明,latin-1 代码点范围是 0-255,并指向该范围内与 Unicode 完全相同的字符.因此,当以 latin-1 编码时,该范围内的 Unicode 代码点将产生相同的值.所以 u'xe9' (233) 以 latin-1 编码也将产生二进制字符串 'xe9'.终端接收该值并尝试在 latin-1 字符映射上匹配它.就像情况 (1) 一样,它产生é",这就是显示的内容.

现在让我们从下拉菜单中将终端的编码设置更改为 UTF-8(就像更改 Web 浏览器的编码设置一样).无需停止 Python 或重新启动 shell.终端的编码现在与 Python 的匹配.让我们再次尝试打印:

<预><代码>>>>打印 'xe9' # (4)>>>打印 u'xe9' # (5)é>>>打印 u'xe9'.encode('latin-1') # (6)>>>

(4) python 按原样输出 binary 字符串.终端尝试使用 UTF-8 解码该流.但是 UTF-8 不理解值 0xe9(见后面的解释),因此无法将其转换为 unicode 代码点.未找到代码点,未打印字符.

(5) python 尝试使用 sys.stdout.encoding 中的任何内容对 Unicode 字符串进行隐式编码.仍然是UTF-8".生成的二进制字符串是 'xc3xa9'.终端接收流并尝试使用 UTF-8 解码 0xc3a9.它产生返回代码值 0xe9 (233),它在 Unicode 字符映射上指向符号é".终端显示é".

(6) python 用 latin-1 编码 unicode 字符串,它产生一个具有相同值 'xe9' 的二进制字符串.同样,对于终端,这与情况 (4) 几乎相同.

结论:- Python 输出非 unicode 字符串作为原始数据,不考虑其默认编码.如果终端当前的编码与数据匹配,则终端恰好会显示它们.- Python 在使用 sys.stdout.encoding 中指定的方案编码后输出 Unicode 字符串.- Python 从 shell 的环境中获取该设置.- 终端根据自己的编码设置显示输出.- 终端的编码与外壳的编码无关.

<小时>

有关 unicode、UTF-8 和 latin-1 的更多详细信息:

Unicode 基本上是一个字符表,其中一些键(代码点)通常被分配为指向一些符号.例如按照惯例,已决定键 0xe9 (233) 是指向符号é"的值.ASCII 和 Unicode 使用相同的代码点,从 0 到 127,latin-1 和 Unicode 从 0 到 255.也就是说,0x41 在 ASCII、latin-1 和 Unicode 中指向 'A',0xc8 在latin-1 和 Unicode,0xe9 指向 latin-1 和 Unicode 中的 'é'.

在使用电子设备时,Unicode 代码点需要一种有效的电子表示方式.这就是编码方案的意义所在.存在各种 Unicode 编码方案(utf7、UTF-8、UTF-16、UTF-32).最直观和直接的编码方法是简单地使用 Unicode 映射中代码点的值作为其电子形式的值,但 Unicode 目前有超过一百万个代码点,这意味着其中一些需要 3 个字节表达.为了有效地处理文本,1 到 1 的映射是相当不切实际的,因为它要求所有代码点都存储在完全相同的空间中,每个字符至少 3 个字节,而不管它们的实际需要.

大多数编码方案在空间要求方面都有缺点,最经济的方案没有覆盖所有的unicode码点,例如ascii只覆盖前128个,而latin-1覆盖前256个.其他试图更全面最终也很浪费,因为它们需要比必要更多的字节,即使对于常见的廉价"字符也是如此.例如,UTF-16 每个字符至少使用 2 个字节,包括 ascii 范围内的那些('B' 为 65,在 UTF-16 中仍然需要 2 个字节的存储空间).UTF-32 更加浪费,因为它将所有字符存储在 4 个字节中.

UTF-8 恰好巧妙地解决了这个难题,其方案能够存储具有可变字节空间数量的代码点.作为其编码策略的一部分,UTF-8 将带有标志位的代码点绑在一起,以指示(大概是解码器)它们的空间要求和边界.

ASCII 范围 (0-127) 中 unicode 代码点的 UTF-8 编码:

0xxx xxxx(二进制)

  • x 显示在编码期间为存储"代码点保留的实际空间
  • 前导 0 是一个标志,它向 UTF-8 解码器指示此代码点将只需要 1 个字节.
  • 在编码时,UTF-8 不会更改该特定范围内的代码点值(即 65 以 UTF-8 编码也是 65).考虑到 Unicode 和 ASCII 在同一范围内也兼容,顺便说一下,UTF-8 和 ASCII 在该范围内也兼容.

例如'B' 的 Unicode 代码点是二进制的 '0x42' 或 0100 0010(正如我们所说,它在 ASCII 中是相同的).UTF-8编码后就变成:

0xxx xxxx <-- Unicode 代码点 0 到 127 的 UTF-8 编码*100 0010 <-- Unicode 代码点 0x420100 0010 <-- UTF-8 编码(完全一样)

127 以上 Unicode 代码点的 UTF-8 编码(非 ascii):

110x xxxx 10xx xxxx <--(从 128 到 2047)1110 xxxx 10xx xxxx 10xx xxxx <--(从 2048 到 65535)

  • 前导位110"向 UTF-8 解码器表示以 2 个字节编码的代码点的开始,而1110"表示 3 个字节,11110 表示 4 个字节,依此类推.
  • 内部10"标志位用于表示内部字节的开始.
  • 再次,x 标记编码后存储 Unicode 代码点值的空间.

例如'é' Unicode 代码点是 0xe9 (233).

1110 1001 <-- 0xe9

UTF-8 对这个值进行编码时,确定该值大于 127 且小于 2048,因此应编码为 2 个字节:

110x xxxx 10xx xxxx <-- Unicode 128-2047 的 UTF-8 编码***0 0011 **10 1001 <-- 0xe91100 0011 1010 1001 <-- UTF-8 编码后的'é'C 3 A 9

UTF-8 编码后的 0xe9 Unicode 码位变为 0xc3a9.这正是终端接收它的方式.如果您的终端设置为使用 latin-1(非 unicode 遗留编码之一)解码字符串,您会看到 é,因为碰巧 latin-1 中的 0xc3 指向 Ã,而 0xa9 指向 ©.

From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'xe9'
é
>>> 

I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't specified an encoding. I guess I don't understand what ASCII being the default encoding means.

EDIT

I moved the edit to the Answers section and accepted it as suggested.

解决方案

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u'xe9', Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it's been initiated from. If it can't find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

Let's for a moment exit the Python shell and set bash's environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

Bingo!

If you now try to output some unicode character outside of ascii you should get a nice error message

>>> print u'xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' 
in position 0: ordinal not in range(128)


Lets exit Python and discard the bash shell.

We'll now observe what happens after Python outputs strings. For this we'll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we'll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn't change the actual shell environment's encoding, it only changes the way the terminal itself will decode output it's given, a bit like a web browser does. You can therefore change the terminal's encoding, independantly from the shell's environment. Let's then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment's encoding (UTF-8 for me):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print 'xe9' # (1)
é
>>> print u'xe9' # (2)
é
>>> print u'xe9'.encode('latin-1') # (3)
é
>>>

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it's "UTF-8". After UTF-8 encoding, the resulting binary string is 'xc3xa9' (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.

(3) python encodes unicode code point u'xe9' (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u'xe9' (233) encoded in latin-1 will also yields the binary string 'xe9'. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields "é" and that's what's displayed.

Let's now change the terminal's encoding settings to UTF-8 from the dropdown menu (like you would change your web browser's encoding settings). No need to stop Python or restart the shell. The terminal's encoding now matches Python's. Let's try printing again:

>>> print 'xe9' # (4)

>>> print u'xe9' # (5)
é
>>> print u'xe9'.encode('latin-1') # (6)

>>>

(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn't understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever's in sys.stdout.encoding. Still "UTF-8". The resulting binary string is 'xc3xa9'. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol "é". Terminal displays "é".

(6) python encodes unicode string with latin-1, it yields a binary string with the same value 'xe9'. Again, for the terminal this is pretty much the same as case (4).

Conclusions: - Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. - Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. - Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the terminal's encoding is independant from the shell's.


More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it's been decided that key 0xe9 (233) is the value pointing to the symbol 'é'. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to 'A' in ASCII, latin-1 and Unicode, 0xc8 points to 'Ü' in latin-1 and Unicode, 0xe9 points to 'é' in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That's what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point's value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don't cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common "cheap" characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range ('B' which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx  (in binary)

  • the x's show the actual space reserved to "store" the code point during encoding
  • The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
  • upon encoding, UTF-8 doesn't change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for 'B' is '0x42' or 0100 0010 in binary (as we said, it's the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

UTF-8 encoding of Unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)

  • the leading bits '110' indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas '1110' indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
  • the inner '10' flag bits are used to signal the beginning of an inner byte.
  • again, the x's mark the space where the Unicode code point value is stored after encoding.

e.g. 'é' Unicode code point is 0xe9 (233).

1110 1001    <-- 0xe9

When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you'll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.

这篇关于为什么 Python 默认编码为 ASCII 时打印 unicode 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆