在CP-1252/ANSI读取上的Python 3扼流圈 [英] Python 3 chokes on CP-1252/ANSI reading

查看:223
本文介绍了在CP-1252/ANSI读取上的Python 3扼流圈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一系列解析器,在这些解析器中,我的单元测试中有大量的回溯,例如:

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:

  File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>

使用open()打开文件,没有多余的争论.我可以将额外的参数传递给open()还是使用编解码器模块中的某些内容以不同的方式打开这些参数?

The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?

这是用Python 2编写的代码,并使用2to3工具将其转换为3.

This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.

更新:事实证明,这是由于将zipfile输入解析器而导致的.单元测试实际上预期会发生这种情况.解析器应将其识别为无法解析的内容.因此,我需要更改我的异常处理.现在正在这样做.

UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.

推荐答案

在Windows-1252(aka cp1252)中未分配位置0x81.它在Latin-1(aka ISO 8859-1)中分配给U + 0081 HIGH OCTET PRESET(HOP)控制字符.我可以像这样在Python 3.1中重现您的错误:

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:

>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>

或带有实际文件:

>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte

现在要将这个文件视为Latin-1,您可以传递encoding参数,例如建议的codeape:

Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:

>>> open('test.txt', encoding='latin-1').read()
'\x81\n'

请注意,Windows-1257和Latin-1编码之间存在差异,例如Latin-1没有智能引号".如果您要处理的文件是文本文件,请问问自己\ x81在其中执行的操作.

Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have "smart quotes". If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

这篇关于在CP-1252/ANSI读取上的Python 3扼流圈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆