Python中的UnicodeDecodeError读取文件时,如何忽略错误并跳转到下一行? [英] UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?

查看:31
本文介绍了Python中的UnicodeDecodeError读取文件时,如何忽略错误并跳转到下一行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须将文本文件读入 Python.文件编码为:

I have to read a text file into Python. The file encoding is:

file -bi test.csv 
text/plain; charset=us-ascii

这是一个第三方文件,我每天都会得到一个新的,所以我宁愿不去改变它.该文件具有非 ascii 字符,例如 Ö.我需要使用 python 读取行,并且我可以忽略具有非 ascii 字符的行.

This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as Ö, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.

我的问题是,当我在 Python 中读取文件时,到达存在非 ascii 字符的行时出现 UnicodeDecodeError,并且无法读取文件的其余部分.

My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.

有没有办法避免这种情况.如果我试试这个:

Is there a way to avoid this. If I try this:

fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
    for line in companiesFile:
        print(line, end="");
except UnicodeDecodeError:
    pass;

然后当达到错误时 for 循环结束,我无法读取文件的其余部分.我想跳过导致错误的行并继续.如果可能,我宁愿不对输入文件做任何更改.

then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.

有没有办法做到这一点?非常感谢.

Is there any way to do this? Thank you very much.

推荐答案

您的文件似乎没有使用 UTF-8 编码.打开文件时使用正确的编解码器很重要.

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.

可以告诉open()如何处理解码错误,用errors关键字:

errors 是一个可选字符串,用于指定如何处理编码和解码错误——这不能在二进制模式下使用.有多种标准错误处理程序可用,但任何已使用 codecs.register_error() 注册的错误处理程序名称也是有效的.标准名称是:

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

  • 'strict' 如果存在编码错误,则引发 ValueError 异常.None 的默认值具有相同的效果.
  • 'ignore' 忽略错误.请注意,忽略编码错误可能会导致数据丢失.
  • 'replace' 会导致替换标记(例如?")插入到格式错误的数据处.
  • 'surrogateescape' 会将任何不正确的字节表示为 Unicode 专用区域中的代码点,范围从 U+DC80 到 U+DCFF.当写入数据时使用 surrogateescape 错误处理程序时,这些私有代码点将被转换回相同的字节.这对于处理未知编码的文件很有用.
  • 'xmlcharrefreplace' 仅在写入文件时受支持.编码不支持的字符将替换为适当的 XML 字符引用 &#nnn;.
  • 'backslashreplace'(也仅在写入时支持)用 Python 的反斜线转义序列替换不支持的字符.
  • 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
  • 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
  • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
  • 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
  • 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

用除 'strict'('ignore''replace' 等)以外的任何东西打开文件,你就可以读取文件而不会引发异常.

Opening the file with anything other than 'strict' ('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.

请注意,解码是按缓冲的数据块进行的,而不是按文本行进行的.如果您必须逐行检测错误,请使用 surrogateescape 处理程序并测试读取的每一行是否有代理范围内的代码点:

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:

import re

_surrogates = re.compile(r"[uDC80-uDCFF]")

def detect_decoding_errors_line(l, _s=_surrogates.finditer):
    """Return decoding errors in a line of text

    Works with text lines decoded with the surrogateescape
    error handler.

    Returns a list of (pos, byte) tuples

    """
    # DC80 - DCFF encode bad bytes 80-FF
    return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
            for m in _s(l)]

例如

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
    for i, line in enumerate(f, 1):
        errors = detect_decoding_errors_line(line)
        if errors:
            print(f"Found errors on line {i}:")
            for (col, b) in errors:
                print(f" {col + 1:2d}: {b[0]:02x}")

考虑到并非所有解码错误都可以优雅地恢复.虽然 UTF-8 被设计为在面对小错误时具有鲁棒性,但其他多字节编码(如 UTF-16 和 UTF-32)无法处理丢失或额外的字节,这将影响行分隔符的准确度位于.上述方法可能会导致文件的其余部分被视为一个长行.如果文件足够大,那么如果行"足够大,又会导致 MemoryError 异常.

Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError exception if the 'line' is large enough.

这篇关于Python中的UnicodeDecodeError读取文件时,如何忽略错误并跳转到下一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆