在python中导入错误连接的JSON [英] Importing wrongly concatenated JSONs in python

查看:84
本文介绍了在python中导入错误连接的JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文档,其中包含数千个jsons字符串,格式为:"{...}{...}{...}".它本身不是有效的json,但每个{...}都是.

I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}". This is not a valid json it self but each {...} is.

我目前使用以下正则表达式对它们进行拆分:

I currently use the following a regular expression to split them:

fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')

基本上会中断大括号闭合和其他打开(}{ -> }\n{)的每一行,因此我可以将它们分成不同的行.

Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{) so I can split them into different lines.

问题在于,其中很少有人具有写为"{tagName1}{tagName2}"tags属性,这会破坏我的正则表达式.

The problem is that few of them have a tags attribute written as "{tagName1}{tagName2}" which breaks my regular expression.

一个例子是:

'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'

被解析为

'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'

代替

'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'

进一步实现json解析的正确方法是什么?

What is the proper way of achieve this for further json parsing?

推荐答案

使用json.JSONDecoder的raw_decode方法

Use the raw_decode method of json.JSONDecoder

>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)

raw_decode返回一个2元组,第一个元素是解码后的JSON,第二个元素是JSON结束后下一个字节的字符串中的偏移量.

raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.

要循环播放直到结尾或遇到无效的JSON元素:

To loop until the end or until an invalid JSON element is encountered:

>>> while True:
...   try:
...     j,n = d.raw_decode(x)
...   except ValueError:
...     break
...   print(j)
...   x=x[n:]
... 
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}

当循环中断时,检查x将显示它是否已处理了整个字符串或遇到JSON语法错误.

When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.

如果有很长的短元素文件,则可以将一个块读入缓冲区并应用上面的循环,将循环中断后剩下的任何内容与下一个块连接起来.

With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.

这篇关于在python中导入错误连接的JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆