在循环中使用numpy加载时发生内存溢出 [英] memory overflow when using numpy load in a loop
问题描述
循环加载npz文件会导致内存溢出(取决于文件 列表长度).
Looping over npz files load causes memory overflow (depending on the file list length).
以下任何一项似乎都无济于事
None of the following seems to help
-
删除将数据存储在文件中的变量.
Deleting the variable which stores the data in the file.
使用mmap.
调用gc.collect()(垃圾回收).
calling gc.collect() (garbage collection).
下面的代码应重现该现象:
The following code should reproduce the phenomenon:
import numpy as np
# generate a file for the demo
X = np.random.randn(1000,1000)
np.savez('tmp.npz',X=X)
# here come the overflow:
for i in xrange(1000000):
data = np.load('tmp.npz')
data.close() # avoid the "too many files are open" error
在我的实际应用程序中,循环遍历文件列表,并且溢出超过24GB的RAM! 请注意,这是在ubuntu 11.10和numpy v上都尝试过的 1.5.1和1.6.0
in my real application the loop is over a list of files and the overflow exceeds 24GB of RAM! please note that this was tried on ubuntu 11.10, and for both numpy v 1.5.1 as well as 1.6.0
我已在 numpy票证2048 中提交了一份报告,但范围可能更广感兴趣,因此我也将其发布在这里(此外,我不确定这是一个错误,但可能是由于我的编程错误造成的.)
I have filed a report in numpy ticket 2048 but this may be of a wider interest and so I am posting it here as well (moreover, I am not sure that this is a bug but may result of my bad programming).
命令
del data.f
应在命令
data.close()
有关更多信息和找到解决方案的方法,请阅读以下HYRY的同类答案
for more information and a method to find the solution, please read HYRY's kind answer below
推荐答案
我认为这是一个错误,也许我找到了解决方法:调用"del data.f".
I think this is a bug, and maybe I found the solution: call "del data.f".
for i in xrange(10000000):
data = np.load('tmp.npz')
del data.f
data.close() # avoid the "too many files are open" error
发现这种内存泄漏.您可以使用以下代码:
to found this kind of memory leak. you can use the following code:
import numpy as np
import gc
# here come the overflow:
for i in xrange(10000):
data = np.load('tmp.npz')
data.close() # avoid the "too many files are open" error
d = dict()
for o in gc.get_objects():
name = type(o).__name__
if name not in d:
d[name] = 1
else:
d[name] += 1
items = d.items()
items.sort(key=lambda x:x[1])
for key, value in items:
print key, value
在测试程序之后,我创建了一个dict并在gc.get_objects()中对对象进行计数.这是输出:
After the test program, I created a dict and count objects in gc.get_objects(). Here is the output:
...
wrapper_descriptor 1382
function 2330
tuple 9117
BagObj 10000
NpzFile 10000
list 20288
dict 21001
从结果中我们知道BagObj和NpzFile出了点问题.查找代码:
From the result we know that there are something wrong with BagObj and NpzFile. Find the code:
class NpzFile(object):
def __init__(self, fid, own_fid=False):
...
self.zip = _zip
self.f = BagObj(self)
if own_fid:
self.fid = fid
else:
self.fid = None
def close(self):
"""
Close the file.
"""
if self.zip is not None:
self.zip.close()
self.zip = None
if self.fid is not None:
self.fid.close()
self.fid = None
def __del__(self):
self.close()
class BagObj(object):
def __init__(self, obj):
self._obj = obj
def __getattribute__(self, key):
try:
return object.__getattribute__(self, '_obj')[key]
except KeyError:
raise AttributeError, key
NpzFile具有 del (),NpzFile.f是BagObj,BagObj._obj是NpzFile,这是一个参考周期,将导致NpzFile和BagObj都无法收集.这是Python文档中的一些解释: http://docs.python.org/library /gc.html#gc.garbage
NpzFile has del(), NpzFile.f is a BagObj, and BagObj._obj is NpzFile, this is a reference cycle and will cause both NpzFile and BagObj uncollectable. Here is some explanation in Python document: http://docs.python.org/library/gc.html#gc.garbage
因此,要打破参考周期,需要调用"del data.f"
So, to break the reference cycle, will need to call "del data.f"
这篇关于在循环中使用numpy加载时发生内存溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!