Python 3 UnicodeDecodeError-如何调试UnicodeDecodeError? [英] Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?

查看:104
本文介绍了Python 3 UnicodeDecodeError-如何调试UnicodeDecodeError?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,发布者(美国证券交易委员会)断言该文本文件以UTF-8编码(

I have a text file which the publisher (the US Securities Exchange Commission) asserts is encoded in UTF-8 (https://www.sec.gov/files/aqfs.pdf, section 4). I'm processing the lines with the following code:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with codecs.open(filename, 'r', encoding='utf-8', errors='strict') as f:
        fields = f.readline().strip().split('\t')
        for line in f.readlines():
            yield process_tag_record(fields, line)

我收到以下错误:

Traceback (most recent call last):
  File "/home/randm/Projects/finance/secxbrl.py", line 151, in <module>
    main()
  File "/home/randm/Projects/finance/secxbrl.py", line 143, in main
    all_tags = list(tags("tag.txt"))
  File "/home/randm/Projects/finance/secxbrl.py", line 109, in tags
    content = f.read()
  File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read
    return self.reader.read(size)
  File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte

鉴于我可能无法回到SEC并告诉他们它们包含的文件似乎未采用UTF-8编码,那么我应该如何调试并捕获此错误?

Given that I probably can't go back to the SEC and tell them they have files that don't seem to be encoded in UTF-8, how should I debug and catch this error?

我尝试了什么

我对该文件进行了十六进制转储,发现有问题的文本是文本非现金投资的补充披露".如果我将有问题的字节解码为十六进制代码点(即"U + 00AD"),则在上下文中是有意义的,因为它是软连字符.但是以下内容似乎无效:

I did a hexdump of the file and found that the offending text was the text "SUPPLEMENTAL DISCLOSURE OF NON�CASH INVESTING". If I decode the offending byte as a hex code point (i.e. "U+00AD"), it makes sense in context as it is the soft hyphen. But the following does not seem to work:

Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b"\x41".decode("utf-8")
'A'
>>> b"\xad".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte
>>> b"\xc2ad".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte

我用过errors='replace',它似乎通过了.但是我想了解一下,如果我尝试将其插入数据库中将会发生什么.

I've used errors='replace', which seems to pass. But I'd like to understand what will happen if I try to insert that into a database.

已编辑以添加十六进制转储:

Edited to add hexdump:

0036ae40  31 09 09 09 09 53 55 50  50 4c 45 4d 45 4e 54 41  |1....SUPPLEMENTA|
0036ae50  4c 20 44 49 53 43 4c 4f  53 55 52 45 20 4f 46 20  |L DISCLOSURE OF |
0036ae60  4e 4f 4e ad 43 41 53 48  20 49 4e 56 45 53 54 49  |NON.CASH INVESTI|
0036ae70  4e 47 20 41 4e 44 20 46  49 4e 41 4e 43 49 4e 47  |NG AND FINANCING|
0036ae80  20 41 43 54 49 56 49 54  49 45 53 3a 09 0a 50 72  | ACTIVITIES:..Pr|

推荐答案

您的数据文件已损坏.如果该字符确实是 U + 00AD SOFT HYPHEN ,那么您缺少一个0xC2字节:

You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:

>>> '\u00ad'.encode('utf8')
b'\xc2\xad'

在所有可能的以0xAD结尾的UTF-8编码中,软连字符确实最有意义.但是,这表明数据集可能可能缺少其他字节.您只是碰巧碰到了一个重要事件.

Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.

我将返回此数据集的来源,并验证文件在下载时是否未损坏.否则,如果没有分隔符(制表符,换行符等)缺失,使用error='replace'是可行的解决方法.

I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace' is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.

另一种可能性是,SEC确实对文件使用了不同编码;例如,在Windows代码页1252和Latin-1中,0xAD是软连字符的正确编码.确实,当我直接下载相同的数据集(警告,已链接较大的ZIP文件),然后打开tags.txt,我无法将数据解码为UTF-8:

Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt, I can't decode the data as UTF-8:

>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
>>> from pprint import pprint
>>> f = open('/tmp/2017q1/tag.txt', 'rb')
>>> f.seek(3583550)
3583550
>>> pprint(f.read(100))
(b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A'
 b'CTIVITIES:\t\nProceedsFromSaleOfIn')

文件中有两个这样的非ASCII字符:

There are two such non-ASCII characters in the file:

>>> f.seek(0)
0
>>> pprint([l for l in f if any(b > 127 for b in l)])
[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0'
 b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I'
 b'NVESTING AND FINANCING ACTIVITIES:\t\n',
 b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani'
 b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h'
 b'e.\n']

解码为Latin-1的

Hotel Kranichh\xf6he HotelKranichhöhe .

Hotel Kranichh\xf6he decoded as Latin-1 is Hotel Kranichhöhe.

文件中还有几对0xC1/0xD1对:

There are also several 0xC1 / 0xD1 pairs in the file:

>>> f.seek(0)
0
>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]
>>> quotes[0].split(b'\t')[-1][50:130]
b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th'
>>> quotes[1].split(b'\t')[-1][50:130]
b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai'

我敢打赌那些确实是 U + 201C左双引号

I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C and 1D parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!

Python没有提供将'\u201C\u201D'编码为b'\x1C\x1D'的编解码器,这使得SEC更有可能在某个地方破坏了其编码过程.实际上,还有0x13和0x14字符可能是 en em 破折号( U + 2013 U + 2014 ),以及0x19字节几乎可以肯定是单引号( U + 2019 ).完整图片所缺少的只是一个0x18字节来表示 U + 2018 .

There is no codec shipping with Python that would encode '\u201C\u201D' to b'\x1C\x1D', making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.

如果我们假设编码已损坏,则可以尝试修复.以下代码将读取文件并解决引号问题,并假设其余数据不使用除引号之外的Latin-1之外的字符:

If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:

_map = {
    # dashes
    0x13: '\u2013', 0x14: '\u2014',
    # single quotes
    0x18: '\u2018', 0x19: '\u2019',
    # double quotes
    0x1c: '\u201c', 0x1d: '\u201d',
}
def repair(line, _map=_map):
    """Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
    return line.translate(_map)

然后将其应用于您阅读的行:

then apply that to lines you read:

with open(filename, 'r', encoding='latin-1') as f:
    repaired = map(repair, f)
    fields = next(repaired).strip().split('\t')
    for line in repaired:
        yield process_tag_record(fields, line)

另外,解决您发布的代码,您使Python的工作比它所需的更加辛苦.不要使用codecs.open();那是已知问题的旧代码,并且比新的Python 3 I/O层要慢.只需使用open().不要使用f.readlines();您无需将整个文件读入此处的列表中.只需直接遍历文件即可:

Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open(); that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open(). Do not use f.readlines(); you don't need to read the whole file into a list here. Just iterate over the file directly:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        fields = next(f).strip().split('\t')
        for line in f:
            yield process_tag_record(fields, line)

如果process_tag_record也在选项卡上拆分,请使用csv.reader()对象,并避免手动拆分每一行:

If process_tag_record also splits on tabs, use a csv.reader() object and avoid splitting each row manually:

import csv

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        reader = csv.reader(f, delimiter='\t')
        fields = next(reader)
        for row in reader:
            yield process_tag_record(fields, row)

如果process_tag_recordfields列表与row中的值组合在一起以形成字典,则只需使用csv.DictReader()即可:

If process_tag_record combines the fields list with the values in row to form a dictionary, just use csv.DictReader() instead:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        reader = csv.DictReader(f, delimiter='\t')
        # first row is used as keys for the dictionary, no need to read fields manually.
        yield from reader

这篇关于Python 3 UnicodeDecodeError-如何调试UnicodeDecodeError?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆