处理JSON中错误转义的字符 [英] Dealing with mis-escaped characters in JSON
问题描述
我正在读取一个包含转义的单引号( \')的Python的JSON文件。这会导致各种各样的打嗝,正如讨论过的那样。 此处。不过,我无法找到任何有关解决问题的方法。我只是做了一个
newstring = originalstring.replace(r\',')
- 对于这样的问题,有一个很好的,干净的程序吗? li>
不幸的是,回到源代码是不可能的。
感谢您的帮助! JSON标准定义了一组特定的有效的2字符转义序列 : \\
, \ /
, \
, \b
, \r
, \\\
,
\ f
和 \ t
和一个4字符的转义序列来定义任何Unicode代码点, \uhhhh
( \u
加上4个十六进制数字)。其他任何反斜杠序列加上其他字符无效的JSON 。
如果您有JSON源,则无法解决,否则唯一的办法就是删除无效序列用 str.replace()
做了,即使它有点脆弱(当引号之前有一个反斜杠序列的时候,它会中断的) 。
你可以使用常规的e也可以使用 sub(r'(?<!\\)\\(?![\\ / bfnrt] | u [0-9a-fA-F] {4})',r ,输入字符串) 这不会发现奇数反斜杠序列,如 I am reading a JSON file into Python which contains escaped single quotes (\'). This leads to all kinds of hiccups, as nicely discussed e.g. here. However, I could not find anything on how to address the issue. I just did a and things worked out. But this seems rather ugly. I could not really find much material on how to deal with this kind of thing (creating an exception, or something) in the json docs either. Going back to the source is not possible, unfortunately. Thanks for your help! The JSON standard defines specific set of valid 2-character escape sequences: If you have a JSON source you can't fix otherwise, the only way out is to remove the invalid sequences, like you did with You could use a regular expression too, where you remove any backslashes not used in a valid sequence: This won't catch out an odd-count backslash sequence like
这篇关于处理JSON中错误转义的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
\\\\
但是会抓住其他的东西:
>>> import re,json
>>> broken = r'带有转义引号的JSON字符串:\'和其他各种转义符:\ a \& \ $和一个换行符\\\
'
>>> json.loads(已损坏)
Traceback(最近一次调用的最后一个):
在< module>文件中的< stdin>
文件/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/__init__.py,第319行,载入
return _default_decoder.decode( s)
文件/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py,第339行解码
obj,end = self.raw_decode(s,idx = _w(s,0).end())
文件/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json /decoder.py,第355行,在raw_decode
obj,end = self.scan_once(s,idx)
json.decoder.JSONDecodeError:无效\escape:第1行第34列(char 33)
>>> json.loads(应用re.sub(R'(小于\\)\\([\\ / bfnrt] | U [0-9A-FA-F] {4}?!?! )',r'',broken))
带有转义引号的JSON字符串和其他各种破解转义:a& $和一个换行符\\\
newstring=originalstring.replace(r"\'", "'")
\\
, \/
, \"
, \b
, \r
, \n
, \f
and \t
, and one 4-character escape sequence to define any Unicode codepoint, \uhhhh
(\u
plus 4 hex digits). Any other sequence of backslash plus other character is invalid JSON.str.replace()
even if it is a little fragile (it'll break when there is an even backslash sequence preceding the quote).fixed = re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', inputstring)
\\\
but will catch anything else:>>> import re, json
>>> broken = r'"JSON string with escaped quote: \' and various other broken escapes: \a \& \$ and a newline!\n"'
>>> json.loads(broken)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 34 (char 33)
>>> json.loads(re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', broken))
"JSON string with escaped quote: ' and various other broken escapes: a & $ and a newline!\n"