Python 3 UnicodeDecodeError - 如何调试 UnicodeDecodeError? [英] Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?

查看:24
本文介绍了Python 3 UnicodeDecodeError - 如何调试 UnicodeDecodeError?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,出版商(美国证券交易委员会)声称它是用 UTF-8 编码的 (https://www.sec.gov/files/aqfs.pdf,第 4 节).我正在使用以下代码处理行:

def 标签(文件名):"""来自 tag.txt 的 Yield Tag 实例.""使用 codecs.open(filename, 'r', encoding='utf-8', errors='strict') 作为 f:fields = f.readline().strip().split('	')对于 f.readlines() 中的行:yield process_tag_record(fields, line)

我收到以下错误:

回溯(最近一次调用最后一次):文件/home/randm/Projects/finance/secxbrl.py",第 151 行,在 <module> 中.主要的()文件/home/randm/Projects/finance/secxbrl.py",第 143 行,在主目录中all_tags = list(tags(tag.txt"))文件/home/randm/Projects/finance/secxbrl.py",第 109 行,在标签中内容 = f.read()文件/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py",第698行,读取返回 self.reader.read(size)文件/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py",第501行,读取newchars, decodebytes = self.decode(data, self.errors)UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 3583587 中的字节 0xad:起始字节无效

鉴于我可能无法回到 SEC 并告诉他们他们的文件似乎不是以 UTF-8 编码的,我应该如何调试和捕获此错误?

我尝试了什么

我对该文件进行了十六进制转储,发现有问题的文本是SUPPLEMENTAL DISCLOSURE OF NON'CASH INVESTING".如果我将违规字节解码为十六进制代码点(即U+00AD"),它在上下文中是有意义的,因为它是软连字符.但以下似乎不起作用:

Python 3.5.2(默认,2016 年 11 月 17 日,17:05:23)[GCC 5.4.0 20160609] 在 linux 上输入帮助"、版权"、信用"或许可证"想要查询更多的信息.>>>b"x41".decode("utf-8")'一个'>>>b"xad".decode("utf-8")回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中.UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 0 中的字节 0xad:起始字节无效>>>b"xc2ad".decode("utf-8")回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中.UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 0 的字节 0xc2: 无效的继续字节

我使用了 errors='replace',它似乎通过了.但我想了解如果我尝试将其插入数据库会发生什么.

十六进制转储:

0036ae40 31 09 09 09 09 53 55 50 50 4c 45 4d 45 4e 54 41 |1....SUPPLEMENTA|0036ae50 4c 20 44 49 53 43 4c 4f 53 55 52 45 20 4f 46 20 |L 披露 |0036ae60 4e 4f 4e ad 43 41 53 48 20 49 4e 56 45 53 54 49 |非现金投资|0036ae70 4e 47 20 41 4e 44 20 46 49 4e 41 4e 43 49 4e 47 |NG 和融资|0036ae80 20 41 43 54 49 56 49 54 49 45 53 3a 09 0a 50 72 |活动:..Pr|

解决方案

您的数据文件已损坏.如果该字符确实是 U+00AD SOFT HYPHEN,那么您缺少一个 0xC2 字节:

<预><代码>>>>'u00ad'.encode('utf8')b'xc2xad'

在以 0xAD 结尾的所有可能的 UTF-8 编码中,软连字符确实最有意义.但是,它表示数据集可能缺少其他字节.你碰巧击中了一个重要的.

我会回到这个数据集的来源并验证文件在下载时没有损坏.否则,使用 error='replace' 是一种可行的解决方法,前提是没有缺少分隔符(制表符、换行符等).

另一种可能性是 SEC 确实对文件使用了不同的 编码;例如,在 Windows 代码页 1252 和 Latin-1 中,0xAD 是软连字符的正确编码.事实上,当我直接下载 相同的数据集时(警告,链接的大 ZIP 文件),然后打开 tags.txt,我无法将数据解码为 UTF-8:

<预><代码>>>>open('/tmp/2017q1/tag.txt', encoding='utf8').read()回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/.../lib/python3.6/codecs.py",第 321 行,解码中(结果,消耗)= self._buffer_decode(数据,self.errors,最终)UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 3583587 中的字节 0xad:起始字节无效>>>从 pprint 导入 pprint>>>f = open('/tmp/2017q1/tag.txt', 'rb')>>>f.seek(3583550)3583550>>>pprint(f.read(100))(b'1 1 非xadCASH 投资和融资 A' 的补充披露b'活动: ProceedsFromSaleOfIn')

文件中有两个这样的非 ASCII 字符:

<预><代码>>>>f.seek(0)0>>>pprint([l for l in f if any(b > 127 for b in l)])[b'SupplementalDisclosureOfNocashInvestingAndFinancingActivityAbstract 0'b'001654954-17-000551 1 1 非xadCASH I'的补充披露b'投资和融资活动: ',b'HotelKranichhheMember 0001558370-17-001446 1 0 member D Hotel Krani'b'chhhe [Member] 代表关于Hotel Kranichhxf6h'的信息b'e. ']

Hotel Kranichhxf6he 解码为 Latin-1 是 克拉尼赫赫酒店.

文件中还有几个 0xC1/0xD1 对:

<预><代码>>>>f.seek(0)0>>>引号 = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]>>>引号[0].split(b' ')[-1][50:130]b'2011 年临时工资税削减延续法案 (x1cTCCAx1d)'>>>引号[1].split(b' ')[-1][50:130]b'固定福利养老金计划(x1cAetna Pension Planx1d)允许确定'

我敢打赌那些真的是U+201C 左双引号U+201D 右双引号字符;注意 1C1D 部分.感觉好像他们的编码器采用了 UTF-16 并去除了所有高字节,而不是正确编码为 UTF-8!

Python 中没有任何编解码器可以将 'u201Cu201D' 编码为 b'x1Cx1D',这使得SEC 在某处搞砸了他们的编码过程.实际上,还有 0x13 和 0x14 字符可能是enem 破折号(U+2013U+2014),以及 0x19 字节几乎可以肯定是单引号(U+2019).完成图片所缺少的只是一个 0x18 字节来表示 U+2018.

如果我们假设编码损坏,我们可以尝试修复.以下代码将读取文件并修复引号问题,假设其余数据不使用除引号之外的 Latin-1 之外的字符:

_map = {# 破折号0x13: 'u2013', 0x14: 'u2014',# 单引号0x18: 'u2018', 0x19: 'u2019',# 双引号0x1c: 'u201c', 0x1d: 'u201d',}def 修复(行,_map=_map):"""修复错误编码的 SEC 数据.假设行被解码为 Latin-1"""返回 line.translate(_map)

然后将其应用于您阅读的行:

 with open(filename, 'r', encoding='latin-1') as f:修复 = 地图(修复,f)fields = next(repaired).strip().split('	')对于已修复的线路:yield process_tag_record(fields, line)

另外,解决您发布的代码,您正在使 Python 工作得比它需要的更难.不要使用 codecs.open();这是具有已知问题并且比较新的 Python 3 I/O 层慢的遗留代码.只需使用 open().不要使用 f.readlines();您无需在此处将整个文件读入列表.直接遍历文件即可:

def 标签(文件名):"""来自 tag.txt 的产量标签实例."""with open(filename, 'r', encoding='utf-8', errors='strict') as f:fields = next(f).strip().split('	')对于 f 中的行:yield process_tag_record(fields, line)

如果 process_tag_record 也在选项卡上拆分,请使用 csv.reader() 对象并避免手动拆分每一行:

导入csvdef标签(文件名):"""来自 tag.txt 的产量标签实例."""with open(filename, 'r', encoding='utf-8', errors='strict') as f:reader = csv.reader(f, delimiter='	')字段 = 下一个(读者)对于阅读器中的行:yield process_tag_record(fields, row)

如果process_tag_recordfields列表和row中的值组合成字典,只需使用csv.DictReader() 代替:

def 标签(文件名):"""来自 tag.txt 的产量标签实例."""with open(filename, 'r', encoding='utf-8', errors='strict') as f:reader = csv.DictReader(f, delimiter='	')# 第一行用作字典的键,无需手动读取字段.读者收益

I have a text file which the publisher (the US Securities Exchange Commission) asserts is encoded in UTF-8 (https://www.sec.gov/files/aqfs.pdf, section 4). I'm processing the lines with the following code:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with codecs.open(filename, 'r', encoding='utf-8', errors='strict') as f:
        fields = f.readline().strip().split('	')
        for line in f.readlines():
            yield process_tag_record(fields, line)

I receive the following error:

Traceback (most recent call last):
  File "/home/randm/Projects/finance/secxbrl.py", line 151, in <module>
    main()
  File "/home/randm/Projects/finance/secxbrl.py", line 143, in main
    all_tags = list(tags("tag.txt"))
  File "/home/randm/Projects/finance/secxbrl.py", line 109, in tags
    content = f.read()
  File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read
    return self.reader.read(size)
  File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte

Given that I probably can't go back to the SEC and tell them they have files that don't seem to be encoded in UTF-8, how should I debug and catch this error?

What have I tried

I did a hexdump of the file and found that the offending text was the text "SUPPLEMENTAL DISCLOSURE OF NON�CASH INVESTING". If I decode the offending byte as a hex code point (i.e. "U+00AD"), it makes sense in context as it is the soft hyphen. But the following does not seem to work:

Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b"x41".decode("utf-8")
'A'
>>> b"xad".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte
>>> b"xc2ad".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte

I've used errors='replace', which seems to pass. But I'd like to understand what will happen if I try to insert that into a database.

Hexdump:

0036ae40  31 09 09 09 09 53 55 50  50 4c 45 4d 45 4e 54 41  |1....SUPPLEMENTA|
0036ae50  4c 20 44 49 53 43 4c 4f  53 55 52 45 20 4f 46 20  |L DISCLOSURE OF |
0036ae60  4e 4f 4e ad 43 41 53 48  20 49 4e 56 45 53 54 49  |NON.CASH INVESTI|
0036ae70  4e 47 20 41 4e 44 20 46  49 4e 41 4e 43 49 4e 47  |NG AND FINANCING|
0036ae80  20 41 43 54 49 56 49 54  49 45 53 3a 09 0a 50 72  | ACTIVITIES:..Pr|

解决方案

You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:

>>> 'u00ad'.encode('utf8')
b'xc2xad'

Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.

I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace' is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.

Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt, I can't decode the data as UTF-8:

>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
>>> from pprint import pprint
>>> f = open('/tmp/2017q1/tag.txt', 'rb')
>>> f.seek(3583550)
3583550
>>> pprint(f.read(100))
(b'1	1				SUPPLEMENTAL DISCLOSURE OF NONxadCASH INVESTING AND FINANCING A'
 b'CTIVITIES:	
ProceedsFromSaleOfIn')

There are two such non-ASCII characters in the file:

>>> f.seek(0)
0
>>> pprint([l for l in f if any(b > 127 for b in l)])
[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract	0'
 b'001654954-17-000551	1	1				SUPPLEMENTAL DISCLOSURE OF NONxadCASH I'
 b'NVESTING AND FINANCING ACTIVITIES:	
',
 b'HotelKranichhheMember	0001558370-17-001446	1	0	member	D		Hotel Krani'
 b'chhhe [Member]	Represents information pertaining to Hotel Kranichhxf6h'
 b'e.
']

Hotel Kranichhxf6he decoded as Latin-1 is Hotel Kranichhöhe.

There are also several 0xC1 / 0xD1 pairs in the file:

>>> f.seek(0)
0
>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]
>>> quotes[0].split(b'	')[-1][50:130]
b'Temporary Payroll Tax Cut Continuation Act of 2011 (x1cTCCAx1d) recognized during th'
>>> quotes[1].split(b'	')[-1][50:130]
b'ributory defined benefit pension plan (the x1cAetna Pension Planx1d) to allow certai'

I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C and 1D parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!

There is no codec shipping with Python that would encode 'u201Cu201D' to b'x1Cx1D', making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.

If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:

_map = {
    # dashes
    0x13: 'u2013', 0x14: 'u2014',
    # single quotes
    0x18: 'u2018', 0x19: 'u2019',
    # double quotes
    0x1c: 'u201c', 0x1d: 'u201d',
}
def repair(line, _map=_map):
    """Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
    return line.translate(_map)

then apply that to lines you read:

with open(filename, 'r', encoding='latin-1') as f:
    repaired = map(repair, f)
    fields = next(repaired).strip().split('	')
    for line in repaired:
        yield process_tag_record(fields, line)

Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open(); that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open(). Do not use f.readlines(); you don't need to read the whole file into a list here. Just iterate over the file directly:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        fields = next(f).strip().split('	')
        for line in f:
            yield process_tag_record(fields, line)

If process_tag_record also splits on tabs, use a csv.reader() object and avoid splitting each row manually:

import csv

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        reader = csv.reader(f, delimiter='	')
        fields = next(reader)
        for row in reader:
            yield process_tag_record(fields, row)

If process_tag_record combines the fields list with the values in row to form a dictionary, just use csv.DictReader() instead:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        reader = csv.DictReader(f, delimiter='	')
        # first row is used as keys for the dictionary, no need to read fields manually.
        yield from reader

这篇关于Python 3 UnicodeDecodeError - 如何调试 UnicodeDecodeError?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆