Python中的UnicodeDecodeError在读取文件时如何忽略错误并跳至下一行? [英] UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?

查看:618
本文介绍了Python中的UnicodeDecodeError在读取文件时如何忽略错误并跳至下一行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须将文本文件读入Python.文件编码为:

I have to read a text file into Python. The file encoding is:

file -bi test.csv 
text/plain; charset=us-ascii

这是一个第三方文件,我每天都会得到一个新文件,因此我不想更改它.该文件具有非ASCII字符,例如Ö.我需要使用python阅读这些行,并且我可以忽略掉具有非ascii字符的行.

This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as Ö, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.

我的问题是,当我用Python读取文件时,到达非ASCII字符所在的行时出现UnicodeDecodeError,而我无法读取文件的其余部分.

My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.

有没有一种方法可以避免这种情况.如果我尝试这样做:

Is there a way to avoid this. If I try this:

fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
    for line in companiesFile:
        print(line, end="");
except UnicodeDecodeError:
    pass;

然后,当错误到达时,for循环结束,并且我无法读取文件的其余部分.我想跳过导致错误的行,然后继续.如果可能的话,我宁愿不对输入文件做任何更改.

then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.

有没有办法做到这一点? 非常感谢.

Is there any way to do this? Thank you very much.

推荐答案

您的文件似乎未使用UTF-8编码.打开文件时,请使用正确的编解码器.

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.

可以告诉 open() 如何使用errors关键字处理解码错误:

You can tell open() how to treat decoding errors, with the errors keyword:

错误是一个可选字符串,用于指定如何处理编码和解码错误-不能在二进制模式下使用.可以使用多种标准错误处理程序,尽管已在codecs.register_error()中注册的任何错误处理名称也有效.标准名称是:

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

    如果出现编码错误,请使用
  • 'strict'引发ValueError异常. None的默认值具有相同的效果.
  • 'ignore'忽略错误.请注意,忽略编码错误会导致数据丢失.
  • 'replace'导致在数据格式错误的地方插入替换标记(例如?").
  • 'surrogateescape'将表示任何不正确的字节,作为Unicode专用区域中从U + DC80到U + DCFF的代码点.当在写入数据时使用surrogateescape错误处理程序时,这些专用代码点将被转换回相同的字节.这对于处理未知编码的文件很有用.
  • 仅在写入文件时支持
  • 'xmlcharrefreplace'.编码不支持的字符被替换为适当的XML字符参考&#nnn;.
  • 'backslashreplace'(也仅在编写时受支持)用Python的反斜杠转义序列替换了不受支持的字符.
  • 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
  • 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
  • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
  • 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
  • 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

使用'strict'('ignore''replace'等)以外的任何文件打开文件,然后您便会在不引发异常的情况下读取文件.

Opening the file with anything other than 'strict' ('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.

请注意,解码是针对每个缓冲的数据块而不是文本行进行的.如果必须逐行检测错误,请使用surrogateescape处理程序并测试每个行读取的代理范围内的代码点:

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:

import re

_surrogates = re.compile(r"[\uDC80-\uDCFF]")

def detect_decoding_errors_line(l, _s=_surrogates.finditer):
    """Return decoding errors in a line of text

    Works with text lines decoded with the surrogateescape
    error handler.

    Returns a list of (pos, byte) tuples

    """
    # DC80 - DCFF encode bad bytes 80-FF
    return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
            for m in _s(l)]

例如

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
    for i, line in enumerate(f, 1):
        errors = detect_decoding_errors_line(line)
        if errors:
            print(f"Found errors on line {i}:")
            for (col, b) in errors:
                print(f" {col + 1:2d}: {b[0]:02x}")

请注意,并非所有解码错误都可以正常恢复.尽管UTF-8的设计在遇到小错误时具有较强的鲁棒性,但其他多字节编码(例如UTF-16和UTF-32)无法应付丢失或多余的字节,这将影响行分隔符的准确度.位于.然后,上述方法可能导致文件的其余部分被视为一条长行.如果文件足够大,那么如果行"足够大,则可能导致MemoryError异常.

Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError exception if the 'line' is large enough.

这篇关于Python中的UnicodeDecodeError在读取文件时如何忽略错误并跳至下一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆