如何增量写入json文件 [英] How to incrementally write into a json file

查看:647
本文介绍了如何增量写入json文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个程序,要求我生成一个非常大的json文件.我知道传统的方法是使用json.dump()转储字典列表,但是该列表太大,以至于总内存+交换空间在转储之前都无法容纳它.无论如何,有没有将其流式传输到json文件的方法,即,将数据递增地写入到json文件中?

I am writing a program, which requires me to generate a very large json file. I know the traditional way is to dump a dictionary list using json.dump(), but the list just got too big that even the total memory + swap space cannot hold it before it is dumped. Is there anyway to stream it into a json file, i.e., write the data into the json file incrementally?

推荐答案

我知道这已经晚了一年,但是问题仍然存在,我很惊讶

I know this is a year late, but the issue is still open and I'm surprised the json.iterencode() was not mentioned.

在此示例中,iterencode的潜在问题是,您希望使用生成器对大型数据集进行迭代处理,而json编码不会序列化生成器.

The potential problem with iterencode in this example, is that you would want to have an iterative handle on the large data set by using a generator, and json encode does not serialize generators.

解决此问题的方法是使用子类列表类型并覆盖__iter__魔术方法,以便产生发生器的输出.

The way around this is to the subclass list type and override the __iter__ magic method so that you could yield the output of your generator.

这是此列表子类的示例.

Here is an example of this list sub class.

class StreamArray(list):
    """
    Converts a generator into a list object that can be json serialisable
    while still retaining the iterative nature of a generator.

    IE. It converts it to a list without having to exhaust the generator
    and keep it's contents in memory.
    """
    def __init__(self, generator):
        self.generator = generator
        self._len = 1

    def __iter__(self):
        self._len = 0
        for item in self.generator:
            yield item
            self._len += 1

    def __len__(self):
        """
        Json parser looks for a this method to confirm whether or not it can
        be parsed
        """
        return self._len

从这里开始的用法非常简单.获取生成器句柄,将其传递到StreamArray类中,将流数组对象传递到iterencode()中,然后遍历这些块.这些块将是json格式的输出,可以直接写入文件.

The usage from here on is quite simple. Get the generator handle, pass it into the StreamArray class, pass the stream array object into iterencode() and iterate over the chunks. The chunks will be json formated output which can be directly written to file.

示例用法:

#Function that will iteratively generate a large set of data.
def large_list_generator_func():
    for i in xrange(5):
        chunk = {'hello_world': i}
        print 'Yielding chunk: ', chunk
        yield chunk

#Write the contents to file:
with open('/tmp/streamed_write.json', 'w') as outfile:
    large_generator_handle = large_list_generator_func()
    stream_array = StreamArray(large_generator_handle)
    for chunk in json.JSONEncoder().iterencode(stream_array):
        print 'Writing chunk: ', chunk
        outfile.write(chunk)

显示产量和写入量的输出是连续发生的.

The output that shows yield and writes happen consecutively.

Yielding chunk:  {'hello_world': 0}
Writing chunk:  [
Writing chunk:  {
Writing chunk:  "hello_world"
Writing chunk:  : 
Writing chunk:  0
Writing chunk:  }
Yielding chunk:  {'hello_world': 1}
Writing chunk:  , 
Writing chunk:  {
Writing chunk:  "hello_world"
Writing chunk:  : 
Writing chunk:  1
Writing chunk:  }

这篇关于如何增量写入json文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆