从文本文件中检索 JSON 对象(使用 Python) [英] Retrieving JSON objects from a text file (using Python)

查看:35
本文介绍了从文本文件中检索 JSON 对象(使用 Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数千个包含多个 JSON 对象的文本文件,但不幸的是,对象之间没有分隔符.对象存储为字典,它们的一些字段本身就是对象.每个对象可能具有可变数量的嵌套对象.具体来说,一个对象可能如下所示:

{field1: {}, field2: "some value", field3: {}, ...}

和数百个这样的对象在文本文件中没有分隔符连接.这意味着我既不能使用 json.load() 也不能使用 json.loads().

有关如何解决此问题的任何建议.是否有已知的解析器可以执行此操作?

解决方案

这将从字符串中解码您的 JSON 对象列表":

from json import JSONDecoderdef load_invalid_obj_list(s):解码器 = JSONDecoder()s_len = len(s)对象 = []结束 = 0而结束!= s_len:obj, end =decoder.raw_decode(s, idx=end)objs.append(obj)返回对象

这里的好处是你可以很好地使用解析器.因此,它会不断告诉您确切地发现错误的位置.

示例

<预><代码>>>>load_invalid_obj_list('{}{}')[{}、{}]>>>load_invalid_obj_list('{}{ }{')回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件decode.py",第 9 行,在loads_invalid_obj_list 中obj, end =decoder.raw_decode(s, idx=end)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py",第 376 行,raw_decodeobj, end = self.scan_once(s, idx)ValueError:预期对象:第 2 行第 2 列(字符 5)

清洁解决方案(稍后添加)

导入json进口重新#shameless 从 json/decoder.py 复制粘贴标志 = re.VERBOSE |re.MULTILINE |重新打点空格 = re.compile(r'[ 	

]*', FLAGS)类 ConcatJSONDecoder(json.JSONDecoder):def 解码(self, s, _w=WHITESPACE.match):s_len = len(s)对象 = []结束 = 0而结束!= s_len:obj, end = self.raw_decode(s, idx=_w(s, end).end())end = _w(s, end).end()objs.append(obj)返回对象

示例

<预><代码>>>>打印 json.loads('{}', cls=ConcatJSONDecoder)[{}]>>>打印 json.load(open('file'), cls=ConcatJSONDecoder)[{}]>>>打印 json.loads('{}{} {', cls=ConcatJSONDecoder)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py",第 339 行,加载中返回 cls(encoding=encoding, **kw).decode(s)文件decode.py",第 15 行,在解码中obj, end = self.raw_decode(s, idx=_w(s, end).end())文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py",第 376 行,raw_decodeobj, end = self.scan_once(s, idx)ValueError:预期对象:第 1 行第 5 列(字符 5)

I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects. Objects are stored as dictionaries and some of their fields are themselves objects. Each object might have a variable number of nested objects. Concretely, an object might look like this:

{field1: {}, field2: "some value", field3: {}, ...} 

and hundreds of such objects are concatenated without a delimiter in a text file. This means that I can neither use json.load() nor json.loads().

Any suggestion on how I can solve this problem. Is there a known parser to do this?

解决方案

This decodes your "list" of JSON Objects from a string:

from json import JSONDecoder

def loads_invalid_obj_list(s):
    decoder = JSONDecoder()
    s_len = len(s)

    objs = []
    end = 0
    while end != s_len:
        obj, end = decoder.raw_decode(s, idx=end)
        objs.append(obj)

    return objs

The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.

Examples

>>> loads_invalid_obj_list('{}{}')
[{}, {}]

>>> loads_invalid_obj_list('{}{
}{')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "decode.py", line 9, in loads_invalid_obj_list
    obj, end = decoder.raw_decode(s, idx=end)
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)

Clean Solution (added later)

import json
import re

#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ 	

]*', FLAGS)

class ConcatJSONDecoder(json.JSONDecoder):
    def decode(self, s, _w=WHITESPACE.match):
        s_len = len(s)

        objs = []
        end = 0
        while end != s_len:
            obj, end = self.raw_decode(s, idx=_w(s, end).end())
            end = _w(s, end).end()
            objs.append(obj)
        return objs

Examples

>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]

>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]

>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "decode.py", line 15, in decode
    obj, end = self.raw_decode(s, idx=_w(s, end).end())
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)

这篇关于从文本文件中检索 JSON 对象(使用 Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆