加载JSON文件时出现MemoryError [英] MemoryError when loading a JSON file

查看:892
本文介绍了加载JSON文件时出现MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我加载500Mo大的JSON文件时,Python(和spyder)返回MemoryError.

Python (and spyder) return a MemoryError when I load a JSON file which is 500Mo large.

但是我的计算机上有32Go RAM,当我尝试加载时,spyder显示的内存"从15%变为19%!看来我有更多的空间...

But my computer have a 32Go RAM and the "memory" displayed by spyder go from 15% to 19% when I try to load it! It seems that I sould have much more space...

我没想到的事吗?

推荐答案

500MB的JSON数据不会导致500MB的内存使用.这将导致上述结果的倍数.确切取决于数据的因素是什么,但是10-25的因素并不少见.

500MB of JSON data does not result in 500MB of memory usage. It will result in a multiple of that. Exactly by what factor depends on the data, but a factor of 10 - 25 is not uncommon.

例如,以下由14个字符(磁盘上的字节)组成的简单JSON字符串导致Python对象的大小几乎增大了25倍(Python 3.6b3):

For example, the following simple JSON string of 14 characters (bytes on disk) results in a Python object is almost 25 times larger (Python 3.6b3):

>>> import json
>>> from sys import getsizeof
>>> j = '{"foo": "bar"}'
>>> len(j)
14
>>> p = json.loads(j)
>>> getsizeof(p) + sum(getsizeof(k) + getsizeof(v) for k, v in p.items())
344
>>> 344 / 14
24.571428571428573

那是因为Python对象需要一些开销;实例跟踪对它们的引用数量,它们的类型,属性(如果类型支持属性)或内容(对于容器).

That's because Python objects require some overhead; instances track the number of references to them, what type they are, and their attributes (if the type supports attributes) or their contents (in the case of containers).

如果您使用内置库json加载该文件,则必须在解析内容时根据内容构建越来越大的对象,并在某些时候 操作系统将拒绝提供更多的内存.那不会是32GB,因为每个进程可以使用多少内存是有限制的,每个进程只能使用4GB.此时,所有已创建的对象将再次释放,因此最终实际的内存使用不必有太大的变化.

If you are using the json built-in library to load that file, it'll have to build larger and larger objects from the contents as they are parsed, and at some point your OS will refuse to provide more memory. That won't be at 32GB, because there is a limit per process how much memory can be used, so more likely to be at 4GB. At that point all those objects already created are freed again, so in the end the actual memory use doesn't have to have changed that much.

解决方案是将大JSON文件分解为较小的子集,或者使用事件驱动的JSON解析器,例如

The solution is to either break up that large JSON file into smaller subsets, or to use an event-driven JSON parser like ijson.

事件驱动的JSON解析器不会为整个文件创建Python对象,而不会为当前已解析的项目创建Python对象,并且会通过事件将创建的每个项目通知您的代码(例如启动数组,这是一个字符串" ,现在开始映射,这是映射的结束,依此类推).然后,您可以决定所需和保留的数据以及要忽略的数据.您忽略的所有内容都会被再次丢弃,并且内存使用率会降低.

An event-driven JSON parser doesn't create Python objects for the whole file, only for the currently parsed item, and notifies your code for each item it created with an event (like 'starting an array, here is a string, now starting a mapping, this is the end of the mapping, etc.). You can then decide what data you need and keep, and what to ignore. Anything you ignore is discarded again and memory use is kept low.

这篇关于加载JSON文件时出现MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆