将大型JSON数组写入文件 [英] Writing a Large JSON Array To File

查看:249
本文介绍了将大型JSON数组写入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个可能最终试图将非常大的json数组序列化到文件的过程.因此,将整个阵列加载到内存中,然后仅转储到文件将不起作用.我需要将各个项目流式传输到文件中,以避免出现内存不足的问题.

I am working on a process that will likely end up attempting to serialize very large json arrays to file. So loading up the entire array in memory and just dumping to file won't work. I need to stream the individual items to file to avoid out of memory issues.

令人惊讶的是,我找不到执行此操作的任何示例.下面的代码段是我拼凑而成的.有更好的方法吗?

Surprisingly, I can't find any examples of doing this. The code snippet below is something I've cobbled together. Is there a better way to do this?

first_item = True
with open('big_json_array.json', 'w') as out:
     out.write('[')
     for item in some_very_big_iterator:
          if first_item:
               out.write(json.dumps(item))
               first_item = False
          else:
               out.write("," + json.dumps(item))
     out.write("]")

推荐答案

虽然您的代码合理,但是可以对其进行改进.您有两个合理的选择,以及一个额外的建议.

While your code is reasonable, it can be improved upon. You have two reasonable options, and an additional suggestion.

您的选择是:

  • 不生成数组,而是生成 JSON行输出.对于生成器中的每个项目,这会在文件中写入一个有效的JSON文档,其中没有换行符,后跟换行符.通过默认的json.dump(item, out)配置和后跟`out.write('\ n')的配置很容易生成.您最终得到的是每行都有一个单独的JSON文档的文件.

  • To not generate an array, but to generate JSON Lines output. For each item in your generator this writes a single valid JSON document without newline characters into the file, followed by a newline. This is easy to generate with the default json.dump(item, out) configuration followed by a `out.write('\n'). You end up with a file with a separate JSON document on each line.

优点是您不必担心写入或再次读取数据时的内存问题,否则以后在从文件读取数据时会遇到更大的问题; json模块不能被迭代地加载数据,除非没有手动跳过初始的[和逗号.

The advantages are that you don't have to worry about memory issues when writing or when reading the data again, as you'd otherwise have bigger problems on reading the data from the file later on; the json module can't be made to load data iteratively, not without manually skipping the initial [ and commas.

读取JSON行很简单,请参见

Reading JSON lines is simple, see Loading and parsing a JSON file with multiple JSON objects in Python

将数据包装在生成器和list子类中,以使json.dump()接受它作为列表,然后迭代地写入数据.我将在下面概述.考虑到您现在可能需要反向解决问题,再次使用Python 读取 JSON数据.

Wrap your data in a generator and a list subclass to make json.dump() accept it as a list, then write the data iteratively. I'll outline this below. Take into account that you now may have to solve the problem in reverse, reading the JSON data again with Python.

我的建议是在此处不使用JSON . JSON从未为大型数据集设计,它是一种针对较小负载的Web交换格式.这类数据交换有更好的格式.如果您不能偏离JSON,请至少使用JSON行.

My suggestion is to not use JSON here. JSON was never designed for large-scale data sets, it's a web interchange format aimed at much smaller payloads. There are better formats for this sort of data exchange. If you can't deviate from JSON, then at least use JSON lines.

您可以使用生成器和list子类来迭代地编写JSON数组,以欺骗json库接受它作为列表:

You can write a JSON array iteratively using a generator, and a list subclass to fool the json library into accepting it as a list:

class IteratorAsList(list):
    def __init__(self, it):
        self.it = it
    def __iter__(self):
        return self.it
    def __len__(self):
        return 1

with open('big_json_array.json', 'w') as out:
    json.dump(IteratorAsList(some_very_big_iterator), out)

IteratorAsList类满足json编码器进行的两项测试:对象是其列表或子类,并且其长度大于0;当满足这些条件时,它将遍历列表(使用__iter__)并对每个对象进行编码.

The IteratorAsList class satisfies two tests that the json encoder makes: that the object is a list or subclass thereof, and that it has a length greater than 0; when those conditions are met it'll iterate over the list (using __iter__) and encode each object.

json.dump()函数随后随着编码器产生数据块而写入文件;它永远不会将所有输出保存在内存中.

The json.dump() function then writes to the file as the encoder yields data chunks; it'll never hold all the output in memory.

这篇关于将大型JSON数组写入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆