Python - 处理混合编码文件 [英] Python - dealing with mixed-encoding files

查看:23
本文介绍了Python - 处理混合编码文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个主要是 UTF-8 的文件,但也找到了一些 Windows-1252 字符.

我创建了一个表来将 Windows-1252 (cp1252) 字符映射到对应的 Unicode 字符,并想用它来修复错误编码的字符,例如

cp1252_to_unicode = {"x85": u'u2026', # ..."x91": u'u2018', #'"x92": u'u2019', # '"x93": u'u201c', # "x94": u'u201d', # ""x97": u'u2014' # —}for l in open('file.txt'):对于 c, u 在 cp1252_to_unicode.items() 中:l = l.replace(c, u)

但是尝试以这种方式进行替换会导致引发 UnicodeDecodeError,例如:

"x85".replace("x85", u'u2026')UnicodeDecodeError: 'ascii' 编解码器无法解码位置 0 中的字节 0x85:序号不在范围内 (128)

对于如何处理这个问题有什么想法吗?

解决方案

如果您尝试将此字符串解码为 utf-8,正如您所知,您将收到UnicodeDecode"错误,因为这些虚假的 cp1252 字符无效utf-8 -

然而,Python 编解码器允许您注册一个回调来处理编码/解码g 错误,使用 codecs.register_error 函数 - 它获取 UnicodeDecodeerror aa 参数 - 您可以编写这样一个处理程序,尝试将数据解码为cp1252",并继续以 utf-8 解码其余部分字符串.

在我的 utf-8 终端中,我可以构建一个混合的错误字符串,如下所示:

<预><代码>>>>a = u"maçã".encode("utf-8") + u"maçã".encode("cp1252")>>>打印一个maçãma>>>a.decode("utf-8")回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/usr/lib/python2.6/encodings/utf_8.py",第 16 行,解码返回 codecs.utf_8_decode(输入,错误,真)UnicodeDecodeError: 'utf8' 编解码器无法解码位置 9-11 中的字节:无效数据

我在这里编写了上述回调函数,并发现了一个问题:即使您将解码字符串的位置增加 1,以便它从下一个字符开始,如果下一个字符也不是 utf-8 和 out of range(128),在第一个 out of range(128) 字符处引发错误 - 这意味着,如果找到连续的非 ascii、非 utf-8 字符,则解码返回".

解决这个问题的方法是在 error_handler 中有一个状态变量,它检测这个走回"并从最后一次调用开始恢复解码——在这个简短的例子中,我将它实现为一个全局变量——(它将有在每次调用解码器之前手动重置为-1"):

导入编解码器最后位置 = -1定义混合解码器(unicode_error):全局last_position字符串 = unicode_error[1]位置 = unicode_error.start如果位置 <= last_position:位置 = last_position + 1last_position = 位置new_char = string[position].decode("cp1252")#new_char = u"_"返回新字符,位置 + 1codecs.register_error("混合",mixed_decoder)

在控制台上:

<预><代码>>>>a = u"maçã".encode("utf-8") + u"maçã".encode("cp1252")>>>最后位置 = -1>>>打印 a.decode("utf-8", "mixed")maçã maçã

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.

I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.

cp1252_to_unicode = {
    "x85": u'u2026', # …
    "x91": u'u2018', # ‘
    "x92": u'u2019', # ’
    "x93": u'u201c', # "
    "x94": u'u201d', # "
    "x97": u'u2014'  # —
}

for l in open('file.txt'):
    for c, u in cp1252_to_unicode.items():
        l = l.replace(c, u)

But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:

"x85".replace("x85", u'u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

Any ideas for how to deal with this?

解决方案

If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -

However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.

In my utf-8 terminal, I can build a mixed incorrect string like this:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma�� 
>>> a.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data

I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.

The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):

import codecs

last_position = -1

def mixed_decoder(unicode_error):
    global last_position
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode("cp1252")
    #new_char = u"_"
    return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

And on the console:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã 

这篇关于Python - 处理混合编码文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆