Python字符串解码问题 [英] Python string decoding issue

查看:167
本文介绍了Python字符串解码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析一个包含一些数据的CSV文件,大多数是数字,但是有一些字符串 - 我不知道他们的编码,但我知道他们是希伯来语。



最后,我需要知道编码,所以我可以unicode的字符串,打印它们,并可能将它们放到数据库后。



我试过使用< ahref =http://chardet.feedparser.org =nofollow noreferrer> Chardet ,声称这些字符串是Windows-1255( cp1255 )但尝试执行 print someString.decode('cp1255')会产生错误的错误:

  UnicodeEncodeError:'ascii'编解码器无法编码位置1-4中的字符:序数不在范围内(128)

我试过每一个其他编码可能,没有效果。此外,该文件是绝对有效的,因为我可以在Excel中打开CSV,我看到正确的数据。



任何想法如何正确解码这些字符串? >




EDIT:这里是一个例子。其中一个字符串看起来像这样(希伯来字母的前五个字母):

  print repr(sampleString)
#prints:
'\xe0\xe1\xe2\xe3\xe4'

(使用Python 2.6.2)

解决方案

这是发生了什么:




  • sampleString是字节字符串(cp1255编码)

  • sampleString.decode(cp1255)解码(decode == bytes - > unicode string)字节字符串到unicode字符串

  • print sampleString.decode(cp1255)尝试将unicode字符串打印到stdout。打印必须编码 unicode字符串才能执行此操作(encode == unicode string - > bytes)。你看到的错误意味着python print语句不能将给定的unicode字符串写入控制台的编码。 sys.stdout.encoding 是终端的编码。



您的控制台不支持这些字符。您应该能够调整控制台以使用其他编码。



另一种方法是手动指定要使用的编码:

  print sampleString.decode(cp1255)。encode(utf-8)

另请参阅:





一个简单的测试程序你可以尝试:

  import sys 
print sys.stdout.encoding
samplestring ='\xe0\xe1\xe2\xe3\xe4'
print samplestring.decode(cp1255)。encode(sys。 argv [1])$ ​​b $ b



在我的utf-8终端上:

  $ python2.6 test.py utf-8 
UTF-8
אבגדה

$ python2.6 test .py latin1
UTF-8
回溯(最近一次调用):
UnicodeEncodeError:'latin-1'编解码器不能在位置0-4中编码字符:序数不在范围内256)

$ python2.6 test.py ascii
UTF-8
Traceback(最近一次调用):
UnicodeEncodeError:'ascii'codec can not编码位置0-4中的字符:序数不在范围内(128)

$ python2.6 test.py cp424
UTF-8
ABCDE

$ python2.6 test.py iso8859_8
UTF-8


$ b b

latin-1和ascii的错误消息表示字符串中的unicode字符无法在这些编码中表示。



注意最后两个。我将unicode字符串编码为cp424和iso8859_8编码(上列出的两种编码)支持希伯来字符的http://docs.python.org/library/codecs.html#standard-encodings )。我使用这些编码没有例外,因为希伯来语unicode字符在编码中有一个表示。



但是我的utf-8终端得到非常困惑,当它接收字节不同于utf-8的编码。



在第一种情况下(cp424),我的UTF-8终端显示ABCDE,这意味着A的utf-8表示对应于ה的cp424表示,即字节值65表示utf-8中的A和cp424中的ה。



encode 方法有一个可选的字符串参数您可以使用指定当编码无法表示字符时应发生的情况(文档)。支持的策略是strict(默认),ignore,replace,xmlcharref和backslashreplace。您甚至可以添加您自己的自定义策略



另一个测试程序(我在字符串周围添加引号以更好地显示忽略行为):

  import sys 
samplestring ='\xe0\xe1\xe2\xe3\xe4'
print'{0}'。format(samplestring.decode(cp1255 ).encode(sys.argv [1],
sys.argv [2]))

结果:

  $ python2.6 test.py latin1 strict 
Traceback
文件test.py,第4行,在< module>
sys.argv [2]))
UnicodeEncodeError:'latin-1'编解码器不能编码位置0-4中的字符:序数不在范围内(256)
[/ tmp]
$ python2.6 test.py latin1 ignore
''
[/ tmp]
$ python2.6 test.py latin1 replace
' '
[/ tmp]
$ python2.6 test.py latin1 xmlcharrefreplace
'&#1488;&#1489;&#1490;&#1491;& 1492;'
[/ tmp]
$ python2.6 test.py latin1 backslashreplace
'\\\א\\\ב\\\ג\\\ד\\\ה'


I am trying to parse a CSV file containing some data, mostly numeral but with some strings - which I do not know their encoding, but I do know they are in Hebrew.

Eventually I need to know the encoding so I can unicode the strings, print them, and perhaps throw them into a database later on.

I tried using Chardet, which claims the strings are Windows-1255 (cp1255) but trying to do print someString.decode('cp1255') yields the notorious error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

I tried every other encoding possible, to no avail. Also, the file is absolutely valid since I can open the CSV in Excel and I see the correct data.

Any idea how I can properly decode these strings?


EDIT: here is an example. One of the strings looks like this (first five letters of the Hebrew alphabet):

print repr(sampleString)
#prints:
'\xe0\xe1\xe2\xe3\xe4'

(using Python 2.6.2)

解决方案

This is what's happening:

  • sampleString is a byte string (cp1255 encoded)
  • sampleString.decode("cp1255") decodes (decode==bytes -> unicode string) the byte string to a unicode string
  • print sampleString.decode("cp1255") attempts to print the unicode string to stdout. Print has to encode the unicode string to do that (encode==unicode string -> bytes). The error that you're seeing means that the python print statement cannot write the given unicode string to the console's encoding. sys.stdout.encoding is the terminal's encoding.

So the problem is that your console does not support these characters. You should be able to tweak the console to use another encoding. The details on how to do that depends on your OS and terminal program.

Another approach would be to manually specify the encoding to use:

print sampleString.decode("cp1255").encode("utf-8")

See also:

A simple test program you can experiment with:

import sys
print sys.stdout.encoding
samplestring = '\xe0\xe1\xe2\xe3\xe4'
print samplestring.decode("cp1255").encode(sys.argv[1])

On my utf-8 terminal:

$ python2.6 test.py utf-8
UTF-8
אבגדה

$ python2.6 test.py latin1
UTF-8
Traceback (most recent call last):
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)

$ python2.6 test.py ascii
UTF-8
Traceback (most recent call last):
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

$ python2.6 test.py cp424
UTF-8
ABCDE

$ python2.6 test.py iso8859_8
UTF-8
�����

The error messages for latin-1 and ascii means that the unicode characters in the string cannot be represented in these encodings.

Notice the last two. I encode the unicode string to the cp424 and iso8859_8 encodings (two of the encodings listed on http://docs.python.org/library/codecs.html#standard-encodings that supports hebrew characters). I get no exception using these encodings, since the hebrew unicode characters have a representation in the encodings.

But my utf-8 terminal gets very confused when it receives bytes in a different encoding than utf-8.

In the first case (cp424), my UTF-8 terminal displays ABCDE, meaning that the utf-8 representation of A corresponds to the cp424 representation of ה, i.e. the byte value 65 means A in utf-8 and ה in cp424.

The encode method has an optional string argument you can use to specify what should happen when the encoding cannot represent a character (documentation). The supported strategies are strict (the default), ignore, replace, xmlcharref and backslashreplace. You can even add your own custom strategies.

Another test program (I print with quotes around the string to better show how ignore behaves):

import sys
samplestring = '\xe0\xe1\xe2\xe3\xe4'
print "'{0}'".format(samplestring.decode("cp1255").encode(sys.argv[1], 
      sys.argv[2]))

The results:

$ python2.6 test.py latin1 strict
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    sys.argv[2]))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)
[/tmp]
$ python2.6 test.py latin1 ignore
''
[/tmp]
$ python2.6 test.py latin1 replace
'?????'
[/tmp]
$ python2.6 test.py latin1 xmlcharrefreplace
'&#1488;&#1489;&#1490;&#1491;&#1492;'
[/tmp]
$ python2.6 test.py latin1 backslashreplace
'\u05d0\u05d1\u05d2\u05d3\u05d4'

这篇关于Python字符串解码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆