Python 非常大的集合.如何避免内存不足异常? [英] Python very large set. How to avoid out of memory exception?

查看:117
本文介绍了Python 非常大的集合.如何避免内存不足异常?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Python 集合来存储唯一的对象.每个对象都覆盖了 __hash____eq__.

该集合包含近 200 000 个对象.该集本身占用近 4 GB 的内存.它在超过 5 GB 的机器上运行良好,但现在我需要在只有 3 GB 可用 RAM 的机器上运行该脚本.

我用 C# 重写了一个脚本——实际上从同一个源读取相同的数据,把它放到一个 CLR 模拟集 (HashSet) 中,而不是 4 GB,它占用了近 350 MB,而脚本执行的速度相对相同(接近 40 秒)但我必须使用 Python.

问题 1:Python 是否有任何磁盘持久化"设置或任何其他解决方法?我猜它只能在内存中存储哈希/eq 方法中使用的关键"数据,而其他所有内容都可以保存到磁盘.或者,Python 中可能还有其他变通方法,可以拥有一个独特的对象集合,这些对象可能占用比系统中可用内存更多的内存.

Q2:不太实际的问题:为什么 python set 为一个 set 占用这么多内存?

我在 64 位 Ubuntu 12.10 上使用标准 Python 2.7.3

谢谢.

更新1:脚本的作用:

  1. 阅读大量半结构化 JSON 文档(每个 JSON 由序列化对象和与其相关的聚合对象集合组成)

  2. 解析每个 JSON 文档以从中检索主对象和聚合集合中的对象.每个解析的对象都存储到一个集合中.Set 仅用于存储唯一对象.首先我使用了一个数据库,但是数据库中的唯一约束的工作速度慢了 x100-x1000.每个 JSON 文档都被解析为 1-8 种不同的对象类型.每个对象类型都存储在它自己的集合中,以便在内存中只保存唯一的对象.

  3. 存储在集合中的所有数据都保存到具有唯一约束的关系数据库中.每个集合都存储在单独的数据库表中.

脚本的整个想法是获取非结构化数据,从 JSON 文档中的聚合对象集合中删除重复项,并将结构化数据存储到关系数据库.

更新 2:

2 delnan:我评论了所有代码行,并添加到不同的集合中,使所有其他人员(获取数据、解析、迭代)保持不变 - 脚本占用的内存减少了 4 GB.

这意味着当这 200K 个对象被添加到集合时 - 它们开始占用大量内存.该对象是来自 TMDB 的简单电影数据 - ID、流派列表、演员、导演列表、许多其他电影详细信息以及可能来自维基百科的大型电影描述.

解决方案

集合确实使用了大量内存,但列表不会.

<预><代码>>>>从 sys 导入 getsizeof>>>a = 范围(100)>>>b = 集合(a)>>>getsizeof(a)872>>>获取大小(b)8424>>>

如果您使用集合的唯一原因是为了防止重复,我建议您改用列表.您可以通过在添加对象之前测试对象是否已经在列表中来防止重复.它可能比使用集合的内置机制慢,但肯定会使用更少的内存.

I use a Python set collection to store unique objects. Every object has __hash__ and __eq__ overridden.

The set contains near 200 000 objects. The set itself takes near 4 GB of memory. It works fine on machine with more than 5 GB, but now I got a need to run the script on a machine that has only 3 GB of RAM available.

I rewrote a script to C# - actually read the same data from the same source, put it to a CLR analogue of set (HashSet) and instead of 4 GB it took near 350 MB while the speed of the script execution was relatively the same (near 40 seconds) But I have to use Python.

Q1: Does Python have any "disk persistent" set or any other workaround? I guess that it can store in memory only "key" data used in hash/eq methods and everything else can be persisted to disk. Or maybe there are other workarounds in Python to have a unique collection of objects that may take more memory than available in the system.

Q2: less practical question: why does python set takes so much more memory for a set?

I use standard Python 2.7.3 on 64 bit Ubuntu 12.10

Thank you.

Update1: What script does:

  1. Read a lot of semi-structured JSON documents (each JSON consist of serialized object with collection of aggregated objects related to it)

  2. Parse each JSON document to retrieve from it the main object and the objects from aggregated collections. Every parsed object is stored to a set. Set is used to store unique objects only. Firstly I used a database, but unique constraint in database works x100-x1000 slower. Every JSON document is parsed to 1-8 different object types. Each object type is stored in it's own set to save in memory only unique objects.

  3. All data stored in sets is saved to relational database with unique constraints. Each set is stored in separate database table.

The whole idea of the script to take unstructured data, remove duplicates from aggregated object collections in the JSON document and store the structured data to relational database.

Update 2:

2 delnan: I commented all lines of code with adding to a different sets keeping all other staff (getting data, parsing, iterating) the same - The script took 4 GB less memory.

It means that when those 200K objects are added to sets - they start taking so much memory. The object is a simple movie data from TMDB - ID, a list of genres, a list of actors, directors, a lot of other movie details and possibly large movie description from Wikipedia.

解决方案

Sets indeed use a lot of memory, but lists don't.

>>> from sys import getsizeof
>>> a = range(100)
>>> b = set(a)
>>> getsizeof(a)
872
>>> getsizeof(b)
8424
>>>

If the only reason why you use a set is to prevent duplicates, I would advise you to use a list instead. You can prevent duplicates by testing if objects are already in your list before adding them. It might be slower than using the built-in mechanics of sets, but it would surely use a lot less memory.

这篇关于Python 非常大的集合.如何避免内存不足异常?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆