在循环中使用numpy加载时发生内存溢出 [英] memory overflow when using numpy load in a loop

查看:640
本文介绍了在循环中使用numpy加载时发生内存溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

循环加载npz文件会导致内存溢出(取决于文件 列表长度).

Looping over npz files load causes memory overflow (depending on the file list length).

以下任何一项似乎都无济于事

None of the following seems to help

  1. 删除将数据存储在文件中的变量.

  1. Deleting the variable which stores the data in the file.

使用mmap.

调用gc.collect()(垃圾回收).

calling gc.collect() (garbage collection).

下面的代码应重现该现象:

The following code should reproduce the phenomenon:

import numpy as np

# generate a file for the demo
X = np.random.randn(1000,1000)
np.savez('tmp.npz',X=X)


# here come the overflow:
for i in xrange(1000000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

在我的实际应用程序中,循环遍历文件列表,并且溢出超过24GB的RAM! 请注意,这是在ubuntu 11.10和numpy v上都尝试过的 1.5.1和1.6.0

in my real application the loop is over a list of files and the overflow exceeds 24GB of RAM! please note that this was tried on ubuntu 11.10, and for both numpy v 1.5.1 as well as 1.6.0

我已在 numpy票证2048 中提交了一份报告,但范围可能更广感兴趣,因此我也将其发布在这里(此外,我不确定这是一个错误,但可能是由于我的编程错误造成的.)

I have filed a report in numpy ticket 2048 but this may be of a wider interest and so I am posting it here as well (moreover, I am not sure that this is a bug but may result of my bad programming).

命令

del data.f

应在命令

data.close()

有关更多信息和找到解决方案的方法,请阅读以下HYRY的同类答案

for more information and a method to find the solution, please read HYRY's kind answer below

推荐答案

我认为这是一个错误,也许我找到了解决方法:调用"del data.f".

I think this is a bug, and maybe I found the solution: call "del data.f".

for i in xrange(10000000):
    data = np.load('tmp.npz')
    del data.f
    data.close()  # avoid the "too many files are open" error

发现这种内存泄漏.您可以使用以下代码:

to found this kind of memory leak. you can use the following code:

import numpy as np
import gc
# here come the overflow:
for i in xrange(10000):
    data = np.load('tmp.npz')
    data.close()  # avoid the "too many files are open" error

d = dict()
for o in gc.get_objects():
    name = type(o).__name__
    if name not in d:
        d[name] = 1
    else:
        d[name] += 1

items = d.items()
items.sort(key=lambda x:x[1])
for key, value in items:
    print key, value

在测试程序之后,我创建了一个dict并在gc.get_objects()中对对象进行计数.这是输出:

After the test program, I created a dict and count objects in gc.get_objects(). Here is the output:

...
wrapper_descriptor 1382
function 2330
tuple 9117
BagObj 10000
NpzFile 10000
list 20288
dict 21001

从结果中我们知道BagObj和NpzFile出了点问题.查找代码:

From the result we know that there are something wrong with BagObj and NpzFile. Find the code:

class NpzFile(object):
    def __init__(self, fid, own_fid=False):
        ...
        self.zip = _zip
        self.f = BagObj(self)
        if own_fid:
            self.fid = fid
        else:
            self.fid = None

    def close(self):
        """
        Close the file.

        """
        if self.zip is not None:
            self.zip.close()
            self.zip = None
        if self.fid is not None:
            self.fid.close()
            self.fid = None

    def __del__(self):
        self.close()

class BagObj(object):
    def __init__(self, obj):
        self._obj = obj
    def __getattribute__(self, key):
        try:
            return object.__getattribute__(self, '_obj')[key]
        except KeyError:
            raise AttributeError, key

NpzFile具有 del (),NpzFile.f是BagObj,BagObj._obj是NpzFile,这是一个参考周期,将导致NpzFile和BagObj都无法收集.这是Python文档中的一些解释: http://docs.python.org/library /gc.html#gc.garbage

NpzFile has del(), NpzFile.f is a BagObj, and BagObj._obj is NpzFile, this is a reference cycle and will cause both NpzFile and BagObj uncollectable. Here is some explanation in Python document: http://docs.python.org/library/gc.html#gc.garbage

因此,要打破参考周期,需要调用"del data.f"

So, to break the reference cycle, will need to call "del data.f"

这篇关于在循环中使用numpy加载时发生内存溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆