Python memoryerror创建大型字典 [英] Python memoryerror creating large dictionary

查看:195
本文介绍了Python memoryerror创建大型字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理一个3GB的XML文件,并且在循环的中间得到一个memoryerror,它读取文件并将一些数据存储在字典中。

I am trying to process a 3GB XML file, and am getting a memoryerror in the middle of a loop that reads the file and stores some data in a dictionary.

class Node(object):
    def __init__(self, osmid, latitude, longitude):
        self.osmid = int(osmid)
        self.latitude = float(latitude)
        self.longitude = float(longitude)
        self.count = 0


context = cElementTree.iterparse(raw_osm_file, events=("start", "end"))
context = iter(context)
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "node":
        lat = float(elem.get('lat'))
        lon = float(elem.get('lon'))
        osm_id = int(elem.get('id'))
        nodes[osm_id] = Node(osm_id, lat, lon)
        root.clear()

我正在使用迭代解析方法,所以问题不在于读取文件。我只想将数据存储在字典中以供后续处理,但似乎字典过大。在程序中,我读了链接,需要检查链接引用的节点是否在初始批次节点中,这就是为什么我将它们存储在字典中。

I'm using an iterative parsing method so the issue isn't with reading the file. I just want to store the data in a dictionary for later processing, but it seems the dictionary is getting too large. Later in the program I read in links and need to check if the nodes referenced by the links were in the initial batch of nodes, which is why I am storing them in a dictionary.

如何可以大大减少内存占用(脚本甚至没有接近完成,所以剃须的位和片断不会太多)或大大增加可用的内存可用的蟒蛇?监视内存使用情况,看起来python大概在1950 MB左右,我的电脑还有大约6 GB的RAM可用。

How can I either greatly reduce memory footprint (the script isn't even getting close to finishing so shaving bits and pieces off won't help much) or greatly increase the amount of memory available to python? Monitoring the memory usage it looks like python is pooping out at about 1950 MB, and my computer still has about 6 GB available of RAM.

推荐答案

假设您已经创建了大量 Node ,您可以考虑使用 __ slots __ 为每个节点预定义固定的一组属性。这消除了存储每个实例的开销 __ dict __ (以防止创建未声明的属性),并且可以轻松地减少每个 Node 因子为〜5x(少于Python 3.3+,其中共享密钥 __ dict __ 可以减少每个实例的内存成本免费)。

Assuming you have tons of Nodes being created, you might consider using __slots__ to predefine a fixed set of attributes for each Node. This removes the overhead of storing a per-instance __dict__ (in exchange for preventing the creation of undeclared attributes) and can easily cut memory usage per Node by a factor of ~5x (less on Python 3.3+ where shared key __dict__ reduces the per-instance memory cost for free).

很容易做到,只需将 Node 的声明更改为:

It's easy to do, just change the declaration of Node to:

class Node(object):
    __slots__ = 'osmid', 'latitude', 'longitude', 'count'

    def __init__(self, osmid, latitude, longitude):
        self.osmid = int(osmid)
        self.latitude = float(latitude)
        self.longitude = float(longitude)
        self.count = 0

例如,在Python 3.5(共享密钥字典已经保存你的东西)的情况下,可以看到对象开销:

For example, on Python 3.5 (where shared key dictionaries already save you something), the difference in object overhead can be seen with:

 >>> import sys
 >>> ... define Node without __slots___
 >>> n = Node(1,2,3)
 >>> sys.getsizeof(n) + sys.getsizeof(n.__dict__)
 248
 >>> ... define Node with __slots__
 >>> n = Node(1,2,3)
 >>> sys.getsizeof(n)  # It has no __dict__ now
 72

请记住,这个是Python 3.5与共享密钥字典;在Python 2中,具有 __ slots __ 的每个实例成本将是类似的(一个指针大小的变量更大的IIRC),而没有 __ slots __ 将上升几百字节。

And remember, this is Python 3.5 with shared key dictionaries; in Python 2, the per-instance cost with __slots__ would be similar (one pointer sized variable larger IIRC), while the cost without __slots__ would go up by a few hundred bytes.

此外,假设您使用的是64位操作系统,请确保已安装64位版本的Python匹配64位操作系统;否则,Python将被限制在〜2 GB的虚拟地址空间,并且您的6 GB的RAM计数很少。

Also, assuming you're on a 64 bit OS, make sure you've installed the 64 bit version of Python to match the 64 bit OS; otherwise, Python will be limited to ~2 GB of virtual address space, and your 6 GB of RAM counts for very little.

这篇关于Python memoryerror创建大型字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆