处理JSON中错误转义的字符 [英] Dealing with mis-escaped characters in JSON

查看:1906
本文介绍了处理JSON中错误转义的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取一个包含转义的单引号( \')的Python的JSON文件。这会导致各种各样的打嗝,正如讨论过的那样。 此处。不过,我无法找到任何有关解决问题的方法。我只是做了一个

  newstring = originalstring.replace(r\',')

以及事情的结果。但是这似乎相当难看。在json



不幸的是,回到源代码是不可能的。



感谢您的帮助! JSON标准定义了一组特定的有效的2字符转义序列 \\ \ / \ \b \r \\\
\ f \ t 和一个4字符的转义序列来定义任何Unicode代码点, \uhhhh \u 加上4个十六进制数字)。其他任何反斜杠序列加上其他字符无效的JSON



如果您有JSON源,则无法解决,否则唯一的办法就是删除无效序列用 str.replace()做了,即使它有点脆弱(当引号之前有一个反斜杠序列的时候,它会中断的) 。



你可以使用常规的e也可以使用 sub(r'(?<!\\)\\(?![\\ / bfnrt] | u [0-9a-fA-F] {4})',r ,输入字符串)

这不会发现奇数反斜杠序列,如 \\\\ 但是会抓住其他的东西:

 >>> import re,json 
>>> broken = r'带有转义引号的JSON字符串:\'和其他各种转义符:\ a \& \ $和一个换行符\\\
'
>>> json.loads(已损坏)
Traceback(最近一次调用的最后一个):
在< module>文件中的< stdin>
文件/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/__init__.py,第319行,载入
return _default_decoder.decode( s)
文件/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py,第339行解码
obj,end = self.raw_decode(s,idx = _w(s,0).end())
文件/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json /decoder.py,第355行,在raw_decode
obj,end = self.scan_once(s,idx)
json.decoder.JSONDecodeError:无效\escape:第1行第34列(char 33)
>>> json.loads(应用re.sub(R'(小于\\)\\([\\ / bfnrt] | U [0-9A-FA-F] {4}?!?! )',r'',broken))
带有转义引号的JSON字符串和其他各种破解转义:a& $和一个换行符\\\


I am reading a JSON file into Python which contains escaped single quotes (\'). This leads to all kinds of hiccups, as nicely discussed e.g. here. However, I could not find anything on how to address the issue. I just did a

newstring=originalstring.replace(r"\'", "'")

and things worked out. But this seems rather ugly. I could not really find much material on how to deal with this kind of thing (creating an exception, or something) in the json docs either.

  • Is there a good, clean procedure for such an issue?

Going back to the source is not possible, unfortunately.

Thanks for your help!

解决方案

The JSON standard defines specific set of valid 2-character escape sequences: \\, \/, \", \b, \r, \n, \f and \t, and one 4-character escape sequence to define any Unicode codepoint, \uhhhh (\u plus 4 hex digits). Any other sequence of backslash plus other character is invalid JSON.

If you have a JSON source you can't fix otherwise, the only way out is to remove the invalid sequences, like you did with str.replace() even if it is a little fragile (it'll break when there is an even backslash sequence preceding the quote).

You could use a regular expression too, where you remove any backslashes not used in a valid sequence:

fixed = re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', inputstring)

This won't catch out an odd-count backslash sequence like \\\ but will catch anything else:

>>> import re, json
>>> broken = r'"JSON string with escaped quote: \' and various other broken escapes: \a \& \$ and a newline!\n"'
>>> json.loads(broken)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 34 (char 33)
>>> json.loads(re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', broken))
"JSON string with escaped quote: ' and various other broken escapes: a & $ and a newline!\n"

这篇关于处理JSON中错误转义的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆