为什么当默认编码为ASCII时,Python打印unicode字符? [英] Why does Python print unicode characters when the default encoding is ASCII?

查看:964
本文介绍了为什么当默认编码为ASCII时,Python打印unicode字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从Python 2.6 shell:

 >>>导入sys 
>>>>打印sys.getdefaultencoding()
ascii
>>>打印u'\xe9'
é
>>>

我预计在打印语句后会有一些乱码或错误,因为é字符不是ASCII的一部分,我还没有指定编码。我想我不明白ASCII是默认编码方式。



编辑



我将编辑移至答案部分,并按照建议接受。

解决方案

谢谢来自各种回复的帖子,我想我们可以解释一下。



通过尝试打印unicode字符串u'\xe9',Python隐式地尝试使用当前存储在sys.stdout.encoding中的编码方案对该字符串进行编码。 Python实际上从环境中启动了这个设置。如果它不能从环境中找到正确的编码,那么只有这样才能恢复到其默认,ASCII。



例如,我使用bash shell,其编码默认为UTF-8。如果我从它开始Python,它会拿起并使用该设置:

  $ python 

> ;>>导入sys
>>>>打印sys.stdout.encoding
UTF-8

让我们一会儿退出Python shell并用一些伪造的方式设置bash的环境:

  $ export LC_CTYPE = klingon 
#我们应该在这里收到一些错误信息,只是忽略它。

然后再次启动python shell,并确认它确实还原为其默认的ascii编码。 p>

  $ python 

>>>导入sys
>>>>打印sys.stdout.encoding
ANSI_X3.4-1968

宾果!



如果您现在尝试在ascii之外输出一些unicode字符,您应该会收到一个很好的错误消息

 >>>打印u'\xe9'
UnicodeEncodeError:'ascii'编解码器不能编码字符u'\xe9'
在位置0:序数不在范围(128)






退出Python并放弃bash shell。



现在我们来看一下Python输出字符串后会发生什么。为此,我们首先在图形终端中启动一个bash shell(我使用Gnome终端),我们将使用ISO-8859-1(也称为latin-1)将终端设置为解码输出(图形终端通常可以选择在其下拉菜单之一设置字符编码)。请注意,这不会改变实际的shell环境的编码,它只会改变终端本身将解码输出的方式,就像Web浏览器一样。因此,您可以更改终端的编码,独立于shell的环境。让我们从shell启动Python,并验证sys.stdout.encoding是否设置为shell环境的编码(对我来说是UTF-8):

  $ python 

>>>导入sys

>>>打印sys.stdout.encoding
UTF-8

>>>打印'\xe9'#(1)
é
>>>打印u'\xe9'#(2)
é
>>>打印u'\xe9'.encode('latin-1')#(3)
é
>>>

(1)python输出二进制字符串,终端接收它并尝试将其值与拉丁-1字符映射。在latin-1中,0xe9或233产生字符é,这就是终端显示的内容。



(2)python试图隐藏使用sys.stdout.encoding中当前设置的任何方案对Unicode字符串进行编码,在这种情况下为UTF-8。 UTF-8编码后,生成的二进制字符串为'\xc3\xa9'(见后文解释)。终端接收流,并尝试使用latin-1解码0xc3a9,但latin-1从0到255,因此,一次只能解码流1个字节。 0xc3a9是2个字节长,因此,latin-1解码器将其解释为0xc3(195)和0xa9(169),并产生2个字符:Ã和©。



)python用latin-1方案编码unicode代码点u'\xe9'(233)。结果拉丁1代码点范围是0-255,并指向与该范围内的Unicode完全相同的字符。因此,当在拉丁文1编码时,该范围内的Unicode代码点将产生相同的值。所以在latin-1中编码的u'\xe9'(233)也会产生二进制串'\xe9'。终端接收该值并尝试在拉丁1字符映射上匹配它。就像案例(1)一样,它产生é,这就是显示的内容。现在我们来从下拉式菜单将终端的编码设置更改为UTF-8(就像您更改Web浏览器的编码设置一样)。不需要停止Python或重新启动shell。终端的编码现在与Python的编码相匹配。我们再试一次打印:

 >>>打印'\xe9'#(4)

>>>打印u'\xe9'#(5)
é
>>>打印u'\xe9'.encode('latin-1')#(6)

>>>

(4)python输出一个二进制字符串。终端尝试使用UTF-8解码该流。但UTF-8不了解值0xe9(见后面的解释),因此无法将其转换为unicode代码点。没有找到代码点,没有字符打印。



(5)python尝试使用sys.stdout.encoding中的任何内容对Unicode字符串进行隐式编码。仍然是UTF-8。生成的二进制字符串为'\xc3\xa9'。终端接收流,并尝试使用UTF-8解码0xc3a9。它返回代码值0xe9(233),在Unicode字符映射上指向符号é。终端显示é。



(6)python用latin-1编码unicode字符串,它产生一个具有相同值'\xe9'的二进制字符串。再次,对于终端,这与case(4)几乎相同。



结论:
- Python将非Unicode代码字符串作为原始数据输出,没有考虑其默认编码。如果当前的编码与数据匹配,终端恰好会显示它们。
- Python使用sys.stdout.encoding中指定的方案对其进行编码后输出Unicode字符串。
- Python从shell的环境中获取该设置。
- 终端根据自己的编码设置显示输出。
- 终端的编码与shell的独立性






有关unicode,UTF- 8和latin-1:



Unicode基本上是一些字符表,其中一些键(代码点)已被传统地指定为指向某些符号。例如按照惯例,已经确定键0xe9(233)是指向符号é的值。 ASCII和Unicode使用相同的代码点,从0到127,拉丁字母1和Unicode从0到255也就是这样,也就是说,0x41在ASCII,拉丁字母1和Unicode中指向A,0xc8指向Ü latin-1和Unicode,0xe9指向latin-1中的'é'和Unicode。



使用电子设备时,Unicode代码点需要一种有效的电子方式。这就是编码方案。存在各种Unicode编码方案(utf7,UTF-8,UTF-16,UTF-32)。最直观和直接的编码方法是简单地使用Unicode映射中的代码点的值作为其电子表单的值,但是Unicode目前有超过一百万个代码点,这意味着它们中的一些需要3个字节表达。为了高效地使用文本,1到1的映射是相当不切实际的,因为它将要求所有代码点以完全相同的空间量存储,每个字符最少为3个字节,而不管其实际需要。 / p>

大多数编码方案对于空间需求都有缺点,最经济的编码方案不包括所有unicode代码点,例如ascii仅涵盖前128个,而latin-1覆盖第一个256.其他试图更全面的结果也是浪费的,因为它们需要比必要的更多的字节,即使是常见的便宜字符。例如,UTF-16每个字符最少使用2个字节,包括ascii范围(B,65,UTF-16中仍然需要2个字节的存储)。 UTF-32在4个字节中存储所有字符更为浪费。



UTF-8巧妙地解决了这个困境,一个方案可以存储具有可变数量字节空间的代码点。作为其编码策略的一部分,UTF-8使用标志位来标识代码点,并指出(可能是解码器)其空间要求及其边界。



UTF- 8编码ascii范围(0-127)中的unicode代码点:

  0xxx xxxx(二进制) 




  • x显示保留以存储代码点的实际空间在编码期间

  • 前导0是向UTF-8解码器指示该代码点只需要1个字节的标志。在编码时,UTF-8不改变该特定范围内的代码点的值(即UTF-8编码的65也是65)。考虑到Unicode和ASCII在相同的范围内也兼容,它偶然使UTF-8和ASCII在该范围内也兼容。



例如B的Unicode代码点是二进制的0x42或0100 0010(正如我们所说,它在ASCII中是一样的)。在UTF-8编码后,它变成:

  0xxx xxxx<  -  UTF-8编码为Unicode代码点0到127 
* 100 0010< - Unicode代码点0x42
0100 0010< - UTF-8编码(完全相同)

UTF-8编码的Unicode代码高于127(非ASCII):

  110x xxxx 10xx xxxx<  - (从128到2047)
1110 xxxx 10xx xxxx 10xx xxxx< - (从2048到65535)




  • 前导位'110'向UTF-8解码器指示编码的码点的开始在2个字节中,而'1110'表示3个字节,11110表示4个字节,等等。

  • 内部'10'标志位用于表示内部字节的开始。

  • 再次,x标记编码后存储Unicode代码点值的空格。



例如'é'Unicode代码点是0xe9(233)。

  1110 1001<  -  0xe9 

当UTF-8编码该值时,它确定该值大于127且小于2048,因此应编码为2字节:

  110x xxxx 10xx xxxx<  -  Unicode 128-2047的UTF-8编码
*** 0 0011 ** 10 1001 < - 0xe9
1100 0011 1010 1001 UTF-8编码后的é
C 3 A 9

UTF-8编码后的0xe9 Unicode代码点变为0xc3a9。终端如何接收它。如果您的终端设置为使用latin-1(非Unicode代码传统编码之一)对字符串进行解码,则会看到é,因为这样发生,拉丁文-1中的0xc3指向Ã和0xa9到©。


From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't specified an encoding. I guess I don't understand what ASCII being the default encoding means.

EDIT

I moved the edit to the Answers section and accepted it as suggested.

解决方案

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u'\xe9', Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it's been initiated from. If it can't find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

Let's for a moment exit the Python shell and set bash's environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

Bingo!

If you now try to output some unicode character outside of ascii you should get a nice error message

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)


Lets exit Python and discard the bash shell.

We'll now observe what happens after Python outputs strings. For this we'll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we'll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn't change the actual shell environment's encoding, it only changes the way the terminal itself will decode output it's given, a bit like a web browser does. You can therefore change the terminal's encoding, independantly from the shell's environment. Let's then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment's encoding (UTF-8 for me):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it's "UTF-8". After UTF-8 encoding, the resulting binary string is '\xc3\xa9' (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.

(3) python encodes unicode code point u'\xe9' (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u'\xe9' (233) encoded in latin-1 will also yields the binary string '\xe9'. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields "é" and that's what's displayed.

Let's now change the terminal's encoding settings to UTF-8 from the dropdown menu (like you would change your web browser's encoding settings). No need to stop Python or restart the shell. The terminal's encoding now matches Python's. Let's try printing again:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn't understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever's in sys.stdout.encoding. Still "UTF-8". The resulting binary string is '\xc3\xa9'. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol "é". Terminal displays "é".

(6) python encodes unicode string with latin-1, it yields a binary string with the same value '\xe9'. Again, for the terminal this is pretty much the same as case (4).

Conclusions: - Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. - Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. - Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the terminal's encoding is independant from the shell's.


More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it's been decided that key 0xe9 (233) is the value pointing to the symbol 'é'. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to 'A' in ASCII, latin-1 and Unicode, 0xc8 points to 'Ü' in latin-1 and Unicode, 0xe9 points to 'é' in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That's what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point's value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don't cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common "cheap" characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range ('B' which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx  (in binary)

  • the x's show the actual space reserved to "store" the code point during encoding
  • The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
  • upon encoding, UTF-8 doesn't change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for 'B' is '0x42' or 0100 0010 in binary (as we said, it's the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

UTF-8 encoding of Unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)

  • the leading bits '110' indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas '1110' indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
  • the inner '10' flag bits are used to signal the beginning of an inner byte.
  • again, the x's mark the space where the Unicode code point value is stored after encoding.

e.g. 'é' Unicode code point is 0xe9 (233).

1110 1001    <-- 0xe9

When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you'll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.

这篇关于为什么当默认编码为ASCII时,Python打印unicode字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆