解析JSON时出现解码问题[python] [英] decoding issue while parsing JSON [python]

查看:247
本文介绍了解析JSON时出现解码问题[python]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Python读取JSON文件,其中包含许多字段和值(约8000条记录). Env:Windows 10,Python 3.6.4; 代码:

I am reading a JSON file in Python which has lots of fields and values (~8000 records). Env: windows 10, python 3.6.4; code:

import json
json_data = json.load(open('json_list.json'))
print (json_data)

与此有关,我得到一个错误.下面是堆栈跟踪:

With this I get an error. Below is the stack trace:

  json_data = json.load(open('json_list.json'))
  File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
    return loads(fp.read(),
  File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>

与此同时,我尝试了

import json
with open('json_list.json', encoding='utf-8') as fd:
     json_data = json.load(fd)
     print (json_data)

与此同时,我的程序运行了很长时间,然后挂起,没有任何输出.

with this my program runs for a long time then hangs with no output.

我搜索了几乎所有与此相关的主题,找不到解决方案.

I have searched almost all topics related to this and could not find a solution.

注意:JSON数据是有效的,因为当我在Postman/任何REST客户端上看到它时,它都不会报告任何异常.

Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.

任何有关此方法或替代解决方案的帮助,将对我如何加载JSON数据(通过将其转换为字符串然后再转换回JSON等任何方式)提供帮助.

Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.

以下是报告的错误周围的文件外观:

Here is what the file looks like around the reported error:

>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
 b'":"2017-04-10","storage_size_gb":"84.747')

推荐答案

您要询问的代码段似乎已经过两次编码.基本上,最初生成此数据的任何内容都会以Latin-1或其他相关编码生成文本(Windows代码页1252?).然后将其输入到将Latin-1转换为UTF-8 ... 两次的过程.

The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.

当然,转换"数据已经是UTF-8,但告诉计算机它是Latin-1,只会产生

Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.

字符串INGL\xc3\x83\xc2\x89S建议进行此分析,如果您可以猜测它应该以大写形式表示 Inglés ,并意识到 É

The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).

请注意,这里至关重要的步骤是能够猜测问题序列应以哪个字符串表示.它还说明了为什么要在有足够上下文的情况下包含有问题的数据的代表性片段! -对于调试至关重要.

Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.

无论如何,如果整个文件具有相同的症状,则应该能够撤消第二轮的,多余的和不正确的重新编码;尽管文件中有这么多错误,但让我想像这可能是一个只有几个记录的本地问题.也许它们是从多个输入文件合并而成的,只有其中一个出现此错误.然后,修复它需要进行大量的侦探工作,手动编辑或识别并修复错误源.一种快速而肮脏的解决方法是简单地手动删除所有错误记录.

Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

这篇关于解析JSON时出现解码问题[python]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆