在python中导入错误连接的JSON [英] Importing wrongly concatenated JSONs in python
问题描述
我有一个文本文档,其中包含数千个jsons字符串,格式为:"{...}{...}{...}"
.它本身不是有效的json,但每个{...}
都是.
I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}"
. This is not a valid json it self but each {...}
is.
我目前使用以下正则表达式对它们进行拆分:
I currently use the following a regular expression to split them:
fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')
基本上会中断大括号闭合和其他打开(}{ -> }\n{
)的每一行,因此我可以将它们分成不同的行.
Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{
) so I can split them into different lines.
问题在于,其中很少有人具有写为"{tagName1}{tagName2}"
的tags
属性,这会破坏我的正则表达式.
The problem is that few of them have a tags
attribute written as "{tagName1}{tagName2}"
which breaks my regular expression.
一个例子是:
'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
被解析为
'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'
代替
'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'
进一步实现json解析的正确方法是什么?
What is the proper way of achieve this for further json parsing?
推荐答案
使用json.JSONDecoder的raw_decode方法
Use the raw_decode method of json.JSONDecoder
>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)
raw_decode返回一个2元组,第一个元素是解码后的JSON,第二个元素是JSON结束后下一个字节的字符串中的偏移量.
raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.
要循环播放直到结尾或遇到无效的JSON元素:
To loop until the end or until an invalid JSON element is encountered:
>>> while True:
... try:
... j,n = d.raw_decode(x)
... except ValueError:
... break
... print(j)
... x=x[n:]
...
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}
当循环中断时,检查x将显示它是否已处理了整个字符串或遇到JSON语法错误.
When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.
如果有很长的短元素文件,则可以将一个块读入缓冲区并应用上面的循环,将循环中断后剩下的任何内容与下一个块连接起来.
With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.
这篇关于在python中导入错误连接的JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!