编码在python 2.7 [英] Encoding in python 2.7

查看:331
本文介绍了编码在python 2.7的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



1. python代码如下,

  #s = u严
s = u'\\\严'
打印是:',s
打印'是:',len(s)
s1 =a+ s
print's1是:',s1
print'sen的len是:',len(s1)

输出是:

 code> s是:严
len的s是:1
s1是:a严
len的s1是:2
/ pre>

我很困惑,为什么 s 的len是1,怎么可能 4e25 存储在1个字节?我还注意到USC-2是2字节长,USC-4是4字节长,为什么unicode字符串 s 的长度是1?



2。
(1)使用notepad ++(Windows 7)新建名为 a.py 的文件,并设置文件的编码 ANSI a.py 中的代码如下:

  # -  *  -  encoding:utf-8  -  *  -  
import sys
print sys.getdefaultencoding()
s =严
prints:,s
打印s的类型,类型

输出是:

  ascii 
s:严
类型的s:< type'str'>

(2)新建一个名为 b.py with notepad ++(Windows 7),并设置文件的编码 UTF-8 b.py 中的代码是如下:

 # -  *  -  encoding:gbk  -  *  -  
import sys
print sys。 getdefaultencoding()
s =严
打印s:,s
打印s的类型,类型
/ pre>

输出是:

 文件D:\\ \\pyws\code\\b.py,第1行
SyntaxError:编码问题:utf-8

(3)更改文件 b.py 如下(文件编码风格为 UTF-8 ):

  import sys 
打印sys.getdefaultencoding()
s =严
prints:,s
prints:,type(s)


$ b $输出是:

  ascii 
s:涓
类型的s:< type 'STR' >

(4)更改文件 a.py 如下(文件的编码风格为 ANSI ):

  import sys 
打印sys.getdefaultencoding()
s =严
打印s:,s
打印s:类型

输出是:

 文件D:\pyws\code\a1.py,第3行
语法错误:文件D中的非ASCII字符\xd1:\pyws\code\a1。 py on
第3行,但没有编码声明;请参阅http://www.python.org/peps/pep-0263.html f
或详细信息


$ b $为什么这4个案例的问题2的产出是不同的?任何人都可以详细了解一下?

解决方案

回答问题1:



在Python版本< 3.3中,Unicode字符串的长度 u''是使用的UTF-16或UTF-32代码单元数(取决于构建标志),而不是字节数。 \\\严 是一个代码单元,但是如果使用UTF-16(Windows上默认),则不是所有字符都由一个代码单元表示。

 >>> len(u'\\\䋥')
1
>>> len(u'\U00010123')
2

在Python 3.3中,对于这两个功能,返回1。



同样,Unicode字符可以组合代码单元,如é normalize 函数可用于生成组合或分解形式:

 >>>将unicodedata导入为ud 
>>> ud.name(u'\xe9')
'拉丁小写E与ACUTE'
>>> ud.normalize('NFD',u'\xe9')
u'e\\\́'
>>> ud.normalize('NFC',u'e\\\́')
u'\xe9'

所以即使在Python 3.3中,单个显示字符也可以有一个或多个代码单元,最好是归一化到一个表单或其他表单以获得一致的答案。



对问题2的回答:



文件必须顶部声明的编码与保存文件的编码一致。该声明让Python知道如何解释文件中的字节。



例如,字符被保存作为保存为UTF-8的文件中的3个字节,但保存为GBK的文件中的两个字节:

 >> >你''代码('utf8')
'\xe4\xb8\xa5'
>>>>你'''$'$'$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ <<<<<<<<<<<<<<<<<<<<<<<<<<

如果您声明错误的编码,字节将被错误地解释,Python会显示错误的字符或引发异常。



每个注释编辑



2(1) - 由于ANSI是系统区域设置的默认编码,这是系统依赖的。在我的系统上, cp1252 和Notepad ++不能显示汉字。如果我将系统区域设置为中文(PRC),那么我可以在控制台终端上获得结果。在这种情况下正确工作的原因是使用字节串,并将字节发送到终端。由于文件在中文(PRC)语言环境中的 ANSI 中编码,字节字符串包含的字节被正确解释通过中文(PRC)区域终端。



2(2) - 文件以UTF-8编码,但编码被声明为GBK。当Python读取编码时,它尝试将该文件解释为GBK并失败。您选择了 UTF-8 作为编码,其中Notepad ++还包含UTF-8编码字节顺序标记(BOM)作为文件中的第一个字符,GBK编解码器不会将其读为有效的GBK编码字符,因此在第1行失败。



2(3) - 文件是以UTF-8(带BOM)编码,但缺少编码声明。 Python识别UTF-8编码的BOM,并使用UTF-8作为编码,但文件是GBK。由于使用了一个字节字符串,UTF-8编码的字节被发送到GBK终端,您将得到:

 > >>你'''代码('utf8')。解码(
'\xe4\xb8\xa5'
>>'\xe4\xb8'.decode 'gbk')
u'\\\涓'
>>> print'\xe4\xb8'.decode('gbk')
$ $ $ $ $ $ $ / code>

在这种情况下,我很惊讶,因为Python忽略了 \xa5 ,如下图所示,当我明确解码时,Python会抛出异常:

 >>> '严'.encode('utf8')。decode('gbk')
追溯(最近的最后一次调用):
文件< interactive input>,第1行,< module>
UnicodeDecodeError:'gbk'编解码器无法解码位置2中的字节0xa5:不完整的多字节序列

2(4) - 在这种情况下,编码是ANSI(GBK),但没有声明编码,并且没有像UTF-8中的BOM给Python提供一个提示,所以它假定ASCII并且不能处理GBK编码的字符第3行。


I have some questions about encoding in python 2.7.

1.The python code is as below,

#s = u"严"
s = u'\u4e25'
print 's is:', s
print 'len of s is:', len(s)
s1 = "a" + s
print 's1 is:', s1
print 'len of s1 is:', len(s1)

the output is:

s is: 严
len of s is: 1
s1 is: a严
len of s1 is: 2

I am confused that why the len of s is 1, how could 4e25 be stored in 1 byte? I also notice that USC-2 is 2-bytes long and USC-4 is 4-bytes long, why unicode string s's length is 1?

2. (1)New a file named a.py with notepad++(Windows 7), and set the file's encoding ANSI, code in a.py is as below:

# -*- encoding:utf-8 -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

ascii
s: 严
type of s: <type 'str'>

(2)New a file named b.py with notepad++(Windows 7), and set the file's encoding UTF-8, code in b.py is as below:

# -*- encoding:gbk -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

  File "D:\pyws\code\\b.py", line 1
SyntaxError: encoding problem: utf-8

(3)change file b.py as below(the file's encoding style is UTF-8):

import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

ascii
s: 涓
type of s: <type 'str'>

(4)change file a.py as below(the file's encoding style is ANSI):

import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

  File "D:\pyws\code\a1.py", line 3
SyntaxError: Non-ASCII character '\xd1' in file D:\pyws\code\a1.py on
line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html f
or details

Why these 4 cases' outputs in question2 are different? Anybody can figure it out in detail?

解决方案

Answer to Question 1:

In Python versions <3.3, length for a Unicode string u'' is the number of UTF-16 or UTF-32 code units used (depending on build flags), not the number of bytes. \u4e25 is one code unit, but not all characters are represented by one code unit if UTF-16 (default on Windows) is used.

>>> len(u'\u42e5')
1
>>> len(u'\U00010123')
2

In Python 3.3, the above will return 1 for both functions.

Also Unicode characters can be composed of combining code units, such as é. The normalize function can be used to generate the combined or decomposed form:

>>> import unicodedata as ud
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.normalize('NFD',u'\xe9')
u'e\u0301'
>>> ud.normalize('NFC',u'e\u0301')
u'\xe9'

So even in Python 3.3, a single display character can have 1 or more code units, and it is best to normalize to one form or another for consistent answers.

Answer to Question 2:

The encoding declared at the top of the file must agree with the encoding in which the file is saved. The declaration lets Python know how to interpret the bytes in the file.

For example, the character is saved as 3 bytes in a file saved as UTF-8, but two bytes in a file saved as GBK:

>>> u'严'.encode('utf8')
'\xe4\xb8\xa5'
>>> u'严'.encode('gbk')
'\xd1\xcf'

If you declare the wrong encoding, the bytes are interpreted incorrectly and Python either displays the wrong characters or throws an exception.

Edit per comment

2(1) - This is system dependent due to ANSI being the system locale default encoding. On my system that is cp1252 and Notepad++ can't display a Chinese character. If I set my system locale to Chinese(PRC) then I get your results on a console terminal. The reason it works correctly in that case is a byte string is used and the bytes are just sent to the terminal. Since the file was encoded in ANSI on a Chinese(PRC) locale, the bytes the byte string contains are correctly interpreted by the Chinese(PRC) locale terminal.

2(2) - The file is encoded in UTF-8 but the encoding is declared as GBK. When Python reads the encoding it tries to interpret the file as GBK and fails. You've chosen UTF-8 as the encoding, which on Notepad++ also includes a UTF-8 encoded byte order mark (BOM) as the first character in the file and the GBK codec doesn't read it as a valid GBK-encoded character, so fails on line 1.

2(3) - The file is encoded in UTF-8 (with BOM), but missing an encoding declaration. Python recognizes the UTF-8-encoded BOM and uses UTF-8 as the encoding, but the file is in GBK. Since a byte string was used, the UTF-8-encoded bytes are sent to the GBK terminal and you get:

>>> u'严'.encode('utf8').decode(
'\xe4\xb8\xa5'
>>> '\xe4\xb8'.decode('gbk')
u'\u6d93'
>>> print '\xe4\xb8'.decode('gbk')
涓

In this case I am surprised, because Python is ignoring the byte \xa5, and as you see below when I explicitly decode incorrectly Python throws an exception:

>>> u'严'.encode('utf8').decode('gbk')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa5 in position 2: incomplete multibyte sequence

2(4) - In this case, then encoding is ANSI (GBK) but no encoding is declared, and there is no BOM like in UTF-8 to give Python a hint, so it assumes ASCII and can't handle the GBK-encoded character on line 3.

这篇关于编码在python 2.7的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆