ValueError:读取json文件时解码“字符串"时出现未配对的高替代 [英] ValueError: Unpaired high surrogate when decoding 'string' on reading json file

查看:145
本文介绍了ValueError:读取json文件时解码“字符串"时出现未配对的高替代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用python 3.8.6.在python中读取(数千个)json文件时出现以下错误:

I am currently working on python 3.8.6. I am getting the following error on reading (thousands of) json files in python:

ValueError: Unpaired high surrogate when decoding 'string' on reading json file

我在检查其他stackoverflow帖子时尝试使用以下解决方案,但无济于事:

I tried using the following solutions while checking other stackoverflow posts but nothing worked:

1) import json
   json.loads('{"":"\\ud800"}')

2) import simplejson
   simplejson.loads('{"":"\\ud800"}')

问题是,在收到此错误后,其余的json文件不会被读取.有没有办法摆脱这个错误,所以我可以阅读所有的json文件?

The problem is that after getting this error the remaining json files are not read. Is there a way to get rid of this error so I can read all the json files?

我不确定要提供有关该问题的所有信息是什么,所以请随时询问.

I am not sure what all information is necessary to provide regarding the problem so please feel free to ask.

推荐答案

Unicode代码点U + D800 代理对(然后仅采用UTF-16编码).因此JSON中的字符串(在解码后)是无效的UTF-8.

Unicode code point U+D800 may only occur as part of a surrogate pair (and then only in UTF-16 encoding). So that string inside the JSON is (after decoding it) not valid UTF-8.

JSON本身可能有效或无效.规范没有提及不匹配的情况代理对,但明确允许不存在的代码点:

The JSON itself might or might not be valid. The spec doesn't mention the case of unmatched surrogate pairs, but does explicitly allow nonexistent code points:

要转义不在基本多语言平面"中的代码点,可以将该字符表示为十二个字符的序列,对对应于该代码点的UTF-16代理对进行编码.因此,例如,仅包含G谱号字符(U + 1D11E)的字符串可以表示为"\ uD834 \ uDD1E".但是,由JSON文本的处理器将这样的代理对解释为单个代码点还是将其解释为显式代理对,是由特定处理器确定的语义决定.

To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.

请注意,JSON语法允许使用Unicode当前不提供字符分配的代码点.

Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.

现在,您可以选择朋友,但不能选择家人,也不能总是选择JSON.所以下一个问题是:如何解析这个烂摊子?

Now, you can choose your friends, but you can't choose your family and you can't always choose your JSON either. So the next question is: how to parse this mess?

看起来Python(3.9版)的内置 json 模块和 simplejson (3.17.2版)都没有解析JSON的问题.仅当您尝试使用字符串时,才会出现此问题.因此,这与JSON完全无关:

It looks like both the built-in json module in Python (version 3.9) and simplejson (version 3.17.2) have no problems parsing the JSON. The problem only occurs once you try to use the string. So this really doesn't have anything to do with JSON at all:

>>> bork = '\ud800'
>>> bork
'\ud800'
>>> print(bork)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

幸运的是,我们可以手动编码字符串,并告诉Python如何处理错误.例如,将错误的代码点替换为问号:

Fortunately, we can encode the string manually and tell Python how to handle the error. For example, replace the erroneous code point with a question mark:

>>> bork.encode('utf-8', errors='replace')
b'?'

文档列出了其他可能的选项code>错误参数.

The documentation lists other possible options for the errors argument.

要修复此损坏的字符串,我们可以进行编码(转换为 bytes ),然后进行解码(重新转换为 str ):

To fix up this broken string, we can encode (into bytes) and then decode (back into str):

>>> bork.encode('utf-8', errors='replace').decode('utf-8')
'?'

这篇关于ValueError:读取json文件时解码“字符串"时出现未配对的高替代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆