Python - 处理混合编码文件 [英] Python - dealing with mixed-encoding files

查看:1134
本文介绍了Python - 处理混合编码文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件大多是UTF-8,但是有些Windows-1252字符也在这里找到。

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found there way in.

我创建了一个从Windows-1252(cp1252)的字符到他们的Unicode对应的,并希望使用它来修正错误编码的字符,例如

I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.

cp1252_to_unicode = {
    "\x85": u'\u2026', # …
    "\x91": u'\u2018', # ‘
    "\x92": u'\u2019', # ’
    "\x93": u'\u201c', # "
    "\x94": u'\u201d', # "
    "\x97": u'\u2014'  # —
}

for l in open('file.txt'):
    for c, u in cp1252_to_unicode.items():
        l = l.replace(c, u)

但是尝试替换这种方式会导致UnicodeDecodeError被引发,例如:

But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:

"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

如何处理这个?

推荐答案

如果您尝试将此sring解码为utf-8,如您所知,您将得到一个UnicodeDecode错误,因为这些虚假的cp1252字符是无效的utf-8 -

If you try to decode this sring as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -

但是,Python编解码器允许您注册回调处理编码/ decodin g错误,与codecs.register_error函数 - 它获取UnicodeDecodeerror aa参数 - 您可以编写一个处理程序,以便将数据解码为cp1252,并在utf-8中继续解码该字符串的其余部分。

However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.

在我的utf-8终端,我可以建立一个混合不正确的字符串,如下所示:

In my utf-8 terminal, I can build a mixed incorrect string like this:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma�� 
>>> a.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data

我写了这个回调函数在这里,找到一个catch:即使你将从字符串解码的位置递增1,这样它会在下一个chcrcer上开始,如果下一个字符也不是utf-8并超出范围(128),错误在第一个超出范围(128)字符时提高 - 这意味着,如果找到连续的非ASCII字符,则解码返回。

I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.

这个工作是在error_handler中有一个状态变量来检测这个走回去,并从最后一次调用恢复解码 - 在这个简短的例子中,我将其实现为全局变量 - (每次调用解码器之前都必须手动将其重置为-1):

The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):

import codecs

last_position = -1

def mixed_decoder(unicode_error):
    global last_position
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode("cp1252")
    #new_char = u"_"
    return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

在控制台上:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã 

这篇关于Python - 处理混合编码文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆