在Python中读取相当大的json文件 [英] Reading rather large json files in Python

查看:478
本文介绍了在Python中读取相当大的json文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能重复:
位于一种内存有效且快速的方式来在python中加载大型json文件?

Possible Duplicate:
Is there a memory efficient and fast way to load big json files in python?

所以我有一些相当大的json编码文件.最小为300MB,但这是最小的.其余的则是多个GB,大约从2GB到10GB +.

So I have some rather large json encoded files. The smallest is 300MB, but this is by far the smallest. The rest are multiple GB, anywhere from around 2GB to 10GB+.

因此,当尝试使用Python加载文件时,我似乎内存不足.我目前正在运行一些测试,以大致了解处理这些东西将花费多长时间,以了解从这里出发的去向.这是我用来测试的代码:

So I seem to run out of memory when trying to load the file with Python. I'm currently just running some tests to see roughly how long dealing with this stuff is going to take to see where to go from here. Here is the code I'm using to test:

from datetime import datetime
import json

print datetime.now()

f = open('file.json', 'r')
json.load(f)
f.close()

print datetime.now()

并不奇怪,Python给了我一个MemoryError.看来json.load()会调用json.loads(f.read()),该尝试首先将整个文件转储到内存中,这显然是行不通的.

Not too surprisingly, Python gives me a MemoryError. It appears that json.load() calls json.loads(f.read()), which is trying to dump the entire file into memory first, which clearly isn't going to work.

有什么办法可以彻底解决这个问题吗?

Any way I can solve this cleanly?

我知道这很旧,但是我不认为这是重复的.虽然答案是相同的,但问题是不同的.在重复"中,问题是如何有效地读取大文件,而该问题处理的是甚至根本无法容纳到内存中的文件.效率不是必需的.

I know this is old, but I don't think this is a duplicate. While the answer is the same, the question is different. In the "duplicate", the question is how to read large files efficiently, whereas this question deals with files that won't even fit in to memory at all. Efficiency isn't required.

推荐答案

这里的问题是,JSON作为一种格式,通常会被完整解析,然后在内存中进行处理,对于如此大量的数据而言,显然有问题的.

The issue here is that JSON, as a format, is generally parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.

解决方案是将数据作为流处理-读取文件的一部分,进行处理,然后重复.

The solution to this is to work with the data as a stream - reading part of the file, working with it, and then repeating.

最好的选择似乎是使用 ijson 之类的东西-可以与之配合使用的模块JSON作为流而不是块文件.

The best option appears to be using something like ijson - a module that will work with JSON as a stream, rather than as a block file.

也值得一看- kashif的评论关于 json-streamer

Also worth a look - kashif's comment about json-streamer and Henrik Heino's comment about bigjson.

这篇关于在Python中读取相当大的json文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆