Python memoryerror创建大型字典 [英] Python memoryerror creating large dictionary
问题描述
我正在尝试处理一个3GB的XML文件,并且在循环的中间得到一个memoryerror,它读取文件并将一些数据存储在字典中。
I am trying to process a 3GB XML file, and am getting a memoryerror in the middle of a loop that reads the file and stores some data in a dictionary.
class Node(object):
def __init__(self, osmid, latitude, longitude):
self.osmid = int(osmid)
self.latitude = float(latitude)
self.longitude = float(longitude)
self.count = 0
context = cElementTree.iterparse(raw_osm_file, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "node":
lat = float(elem.get('lat'))
lon = float(elem.get('lon'))
osm_id = int(elem.get('id'))
nodes[osm_id] = Node(osm_id, lat, lon)
root.clear()
我正在使用迭代解析方法,所以问题不在于读取文件。我只想将数据存储在字典中以供后续处理,但似乎字典过大。在程序中,我读了链接,需要检查链接引用的节点是否在初始批次节点中,这就是为什么我将它们存储在字典中。
I'm using an iterative parsing method so the issue isn't with reading the file. I just want to store the data in a dictionary for later processing, but it seems the dictionary is getting too large. Later in the program I read in links and need to check if the nodes referenced by the links were in the initial batch of nodes, which is why I am storing them in a dictionary.
如何可以大大减少内存占用(脚本甚至没有接近完成,所以剃须的位和片断不会太多)或大大增加可用的内存可用的蟒蛇?监视内存使用情况,看起来python大概在1950 MB左右,我的电脑还有大约6 GB的RAM可用。
How can I either greatly reduce memory footprint (the script isn't even getting close to finishing so shaving bits and pieces off won't help much) or greatly increase the amount of memory available to python? Monitoring the memory usage it looks like python is pooping out at about 1950 MB, and my computer still has about 6 GB available of RAM.
推荐答案
假设您已经创建了大量 Node
,您可以考虑使用 __ slots __
为每个节点预定义固定的一组属性
。这消除了存储每个实例的开销 __ dict __
(以防止创建未声明的属性),并且可以轻松地减少每个 Node
因子为〜5x(少于Python 3.3+,其中共享密钥 __ dict __
可以减少每个实例的内存成本免费)。
Assuming you have tons of Node
s being created, you might consider using __slots__
to predefine a fixed set of attributes for each Node
. This removes the overhead of storing a per-instance __dict__
(in exchange for preventing the creation of undeclared attributes) and can easily cut memory usage per Node
by a factor of ~5x (less on Python 3.3+ where shared key __dict__
reduces the per-instance memory cost for free).
很容易做到,只需将 Node
的声明更改为:
It's easy to do, just change the declaration of Node
to:
class Node(object):
__slots__ = 'osmid', 'latitude', 'longitude', 'count'
def __init__(self, osmid, latitude, longitude):
self.osmid = int(osmid)
self.latitude = float(latitude)
self.longitude = float(longitude)
self.count = 0
例如,在Python 3.5(共享密钥字典已经保存你的东西)的情况下,可以看到对象开销:
For example, on Python 3.5 (where shared key dictionaries already save you something), the difference in object overhead can be seen with:
>>> import sys
>>> ... define Node without __slots___
>>> n = Node(1,2,3)
>>> sys.getsizeof(n) + sys.getsizeof(n.__dict__)
248
>>> ... define Node with __slots__
>>> n = Node(1,2,3)
>>> sys.getsizeof(n) # It has no __dict__ now
72
请记住,这个是Python 3.5与共享密钥字典;在Python 2中,具有 __ slots __
的每个实例成本将是类似的(一个指针大小的变量更大的IIRC),而没有 __ slots __ $的成本c $ c>将上升几百字节。
And remember, this is Python 3.5 with shared key dictionaries; in Python 2, the per-instance cost with __slots__
would be similar (one pointer sized variable larger IIRC), while the cost without __slots__
would go up by a few hundred bytes.
此外,假设您使用的是64位操作系统,请确保已安装64位版本的Python匹配64位操作系统;否则,Python将被限制在〜2 GB的虚拟地址空间,并且您的6 GB的RAM计数很少。
Also, assuming you're on a 64 bit OS, make sure you've installed the 64 bit version of Python to match the 64 bit OS; otherwise, Python will be limited to ~2 GB of virtual address space, and your 6 GB of RAM counts for very little.
这篇关于Python memoryerror创建大型字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!