巨大的数据集的Python双重免费错误 [英] Python double free error for huge datasets

查看:84
本文介绍了巨大的数据集的Python双重免费错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中有一个非常简单的脚本,但是由于某些原因,在运行大量数据时出现以下错误:

I have a very simple script in Python, but for some reason I get the following error when running a large amount of data:

*** glibc detected *** python: double free or corruption (out): 0x00002af5a00cc010 ***

当人们尝试释放已经释放的内存时,我已经习惯了C或C ++中出现的这些错误.但是,据我对Python的理解(尤其是我编写代码的方式),我真的不明白为什么会发生这种情况.

I am used to these errors coming up in C or C++, when one tries to free memory that has already been freed. However, by my understanding of Python (and especially the way I've written the code), I really don't understand why this should happen.

这是代码:

#!/usr/bin/python -tt                                                                                                                                                                                                                         

import sys, commands, string
import numpy as np
import scipy.io as io
from time import clock

W = io.loadmat(sys.argv[1])['W']
size = W.shape[0]
numlabels = int(sys.argv[2])
Q = np.zeros((size, numlabels), dtype=np.double)
P = np.zeros((size, numlabels), dtype=np.double)
Q += 1.0 / Q.shape[1]
nu = 0.001
mu = 0.01
start = clock()
mat = -nu + mu*(W*(np.log(Q)-1))
end = clock()
print >> sys.stderr, "Time taken to compute matrix: %.2f seconds"%(end-start)

一个人可能会问,为什么要声明一个P和一个Q numpy数组?我只是这样做以反映实际情况(因为此代码只是我实际执行的工作的一部分,因此我需要一个P矩阵并事先进行声明).

One may ask, why declare a P and a Q numpy array? I simply do that to reflect the actual conditions (as this code is simply a segment of what I actually do, where I need a P matrix and declare it beforehand).

我可以使用192GB的计算机,因此我在一个非常大的SciPy稀疏矩阵(220万x 220万,但非常稀疏,这不是问题)上进行了测试.主内存由Q,P和mat矩阵占用,因为它们到2000年为止都是220万个矩阵(大小= 220万,numlabels = 2000).峰值内存高达131GB,可以轻松容纳在内存中.在计算矩阵矩阵时,出现glibc错误,并且我的进程自动进入睡眠(S)状态,而不会释放已占用的131GB.

I have access to a 192GB machine, and so I tested this out on a very large SciPy sparse matrix (2.2 million by 2.2 million, but very sparse, that's not the issue). The main memory is taken up by the Q, P, and mat matrices, as they are all 2.2 million by 2000 matrices (size = 2.2 million, numlabels = 2000). The peak memory goes up to 131GB, which comfortably fits in memory. While the mat matrix is being computed, I get the glibc error, and my process automatically goes into the sleep (S) state, without deallocating the 131GB it has taken up.

鉴于奇怪的(对于Python)错误(我没有明确地分配任何东西),并且对于较小的矩阵尺寸(到2000年大约为150万)很好地工作了,我真的不确定从哪里开始调试它.

Given the bizarre (for Python) error (I am not explicitly deallocating anything), and the fact that this works nicely for smaller matrix sizes (around 1.5 million by 2000), I am really not sure where to start to debug this.

作为起点,我在运行之前设置了"ulimit -s unlimited",但无济于事.

As a starting point, I have set "ulimit -s unlimited" before running, but to no avail.

欢迎提供任何帮助或深入了解numpy的行为(包含大量数据).

Any help or insight into numpy's behavior with really large amounts of data would be welcome.

请注意,这不是内存不足错误-我有196GB,我的进程达到131GB左右,并在此停留了一段时间,然后给出以下错误.

Note that this is NOT an out of memory error - I have 196GB, and my process reaches around 131GB and stays there for some time before giving the error below.

更新:2013年2月16日(太平洋标准时间下午1:10):

根据建议,我在GDB上运行了Python.有趣的是,在一次GDB运行中,我忘记将堆栈大小限制设置为"unlimited",并得到以下输出:

As per suggestions, I ran Python with GDB. Interestingly, on one GDB run I forgot to set the stack size limit to "unlimited", and got the following output:

*** glibc detected *** /usr/bin/python: munmap_chunk(): invalid pointer: 0x00007fe7508a9010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x733b6)[0x7ffff6ec23b6]
/usr/lib64/python2.7/site-packages/numpy/core/multiarray.so(+0x4a496)[0x7ffff69fc496]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4e67)[0x7ffff7af48c7]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x309)[0x7ffff7af6c49]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7ffff7b25592]
/usr/lib64/libpython2.7.so.1.0(+0xfcc61)[0x7ffff7b33c61]
/usr/lib64/libpython2.7.so.1.0(PyRun_FileExFlags+0x84)[0x7ffff7b34074]
/usr/lib64/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x189)[0x7ffff7b347c9]
/usr/lib64/libpython2.7.so.1.0(Py_Main+0x36c)[0x7ffff7b3e1bc]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ffff6e6dbfd]
/usr/bin/python[0x4006e9]
======= Memory map: ========
00400000-00401000 r-xp 00000000 09:01 50336181                           /usr/bin/python2.7
00600000-00601000 r--p 00000000 09:01 50336181                           /usr/bin/python2.7
00601000-00602000 rw-p 00001000 09:01 50336181                           /usr/bin/python2.7
00602000-00e5f000 rw-p 00000000 00:00 0                                  [heap]
7fdf2584c000-7ffff0a66000 rw-p 00000000 00:00 0 
7ffff0a66000-7ffff0a6b000 r-xp 00000000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0a6b000-7ffff0c6a000 ---p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6a000-7ffff0c6b000 r--p 00004000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6b000-7ffff0c6c000 rw-p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6c000-7ffff0c77000 r-xp 00000000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0c77000-7ffff0e76000 ---p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e76000-7ffff0e77000 r--p 0000a000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e77000-7ffff0e78000 rw-p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e78000-7ffff0e79000 rw-p 00000000 00:00 0 
7ffff0e79000-7ffff0e9b000 r-xp 00000000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff0e9b000-7ffff109a000 ---p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109a000-7ffff109b000 r--p 00021000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109b000-7ffff109f000 rw-p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109f000-7ffff10a0000 rw-p 00000000 00:00 0 
7ffff10a0000-7ffff10a5000 r-xp 00000000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff10a5000-7ffff12a4000 ---p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a4000-7ffff12a5000 r--p 00004000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a5000-7ffff12a7000 rw-p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a7000-7ffff12ad000 r-xp 00000000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff12ad000-7ffff14ac000 ---p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ac000-7ffff14ad000 r--p 00005000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ad000-7ffff14ae000 rw-p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ae000-7ffff14b5000 r-xp 00000000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff14b5000-7ffff16b4000 ---p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b4000-7ffff16b5000 r--p 00006000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b5000-7ffff16b6000 rw-p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b6000-7ffff17c2000 r-xp 00000000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff17c2000-7ffff19c2000 ---p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c2000-7ffff19c3000 r--p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c3000-7ffff19c6000 rw-p 0010d000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c6000-7ffff19d5000 r-xp 00000000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff19d5000-7ffff1bd4000 ---p 0000f000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff1bd4000-7ffff1bd5000 r--p 0000e000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
Program received signal SIGABRT, Aborted.
0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
#1  0x00007ffff6e82fb6 in abort () from /lib64/libc.so.6
#2  0x00007ffff6ebcdd3 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6ec23b6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff69fc496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#5  0x00007ffff7af48c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#6  0x00007ffff7af6c49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#7  0x00007ffff7b25592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#8  0x00007ffff7b33c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#9  0x00007ffff7b34074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#10 0x00007ffff7b347c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#11 0x00007ffff7b3e1bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#12 0x00007ffff6e6dbfd in __libc_start_main () from /lib64/libc.so.6
#13 0x00000000004006e9 in _start ()

当我将堆栈大小限制设置为无限制时",我得到以下信息:

When I set the stack size limit to unlimited", I get the following:

*** glibc detected *** /usr/bin/python: double free or corruption (out): 0x00002abb2732c010 ***
^X^C
Program received signal SIGINT, Interrupt.
0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaab969f2e in _L_lock_9927 () from /lib64/libc.so.6
#2  0x00002aaaab9682d1 in free () from /lib64/libc.so.6
#3  0x00002aaaaaabbfe2 in _dl_scope_free () from /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaab70a4 in _dl_map_object_deps () from /lib64/ld-linux-x86-64.so.2
#5  0x00002aaaaaabcaa0 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7  0x00002aaaaaabc5da in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x00002aaaab9fb530 in do_dlopen () from /lib64/libc.so.6
#9  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00002aaaab9fb5cf in dlerror_run () from /lib64/libc.so.6
#11 0x00002aaaab9fb637 in __libc_dlopen_mode () from /lib64/libc.so.6
#12 0x00002aaaab9d60c5 in init () from /lib64/libc.so.6
#13 0x00002aaaab080933 in pthread_once () from /lib64/libpthread.so.0
#14 0x00002aaaab9d61bc in backtrace () from /lib64/libc.so.6
#15 0x00002aaaab95dde7 in __libc_message () from /lib64/libc.so.6
#16 0x00002aaaab9633b6 in malloc_printerr () from /lib64/libc.so.6
#17 0x00002aaaab9682dc in free () from /lib64/libc.so.6
#18 0x00002aaaabef1496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#19 0x00002aaaaad888c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#20 0x00002aaaaad8ac49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#21 0x00002aaaaadb9592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#22 0x00002aaaaadc7c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#23 0x00002aaaaadc8074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#24 0x00002aaaaadc87c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#25 0x00002aaaaadd21bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#26 0x00002aaaab90ebfd in __libc_start_main () from /lib64/libc.so.6
#27 0x00000000004006e9 in _start ()

这使我相信基本问题在于numpy多数组核心模块(第一个输出中的第4行和第二个输出中的第18行).为了以防万一,我将以numpy和scipy的错误报告的形式提出它.

This makes me believe the basic issue is with the numpy multiarray core module (line #4 in the first output and line #18 in the second). I will bring it up as a bug report in both numpy and scipy just in case.

以前有人看过吗?

更新时间:2013年2月17日(太平洋标准时间下午4:45)

我找到了一台可以在上面运行代码的计算机,该计算机具有SciPy(0.11)和NumPy(1.7.0)的最新版本.直接运行代码(没有GDB)会导致seg错误,而没有任何输出到stdout或stderr.通过GDB再次运行,我得到以下信息:

I found a machine that I could run the code on that had a more recent version of SciPy (0.11) and NumPy (1.7.0). Running the code straight up (without GDB) resulted in a seg fault without any output to stdout or stderr. Running again through GDB, I get the following:

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002aaaac5fcd04 in PyDataMem_FREE (ptr=<optimized out>, $K8=<optimized out>) at numpy/core/src/multiarray/multiarraymodule.c:3510
#2  array_dealloc (self=0xc00ab7edbfc228fe) at numpy/core/src/multiarray/arrayobject.c:416
#3  0x0000000000498eac in PyEval_EvalFrameEx ()
#4  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#5  0x00000000004a9081 in PyRun_FileExFlags ()
#6  0x00000000004a9311 in PyRun_SimpleFileExFlags ()
#7  0x00000000004aa8bd in Py_Main ()
#8  0x00002aaaabe4f76d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x000000000041b9b1 in _start ()

我知道这不如使用调试符号编译的NumPy有用,我将尝试这样做并在以后发布输出.

I understand this is not as useful as a NumPy compiled with debugging symbols, I will try doing that and post the output later.

推荐答案

在Numpy Github页面上对同一问题进行讨论之后( https://github.com/numpy/numpy/issues/2995 )已经引起我注意,Numpy/Scipy将不支持如此大量的非零.所得的稀疏矩阵.

After discussions on the same issue on the Numpy Github page (https://github.com/numpy/numpy/issues/2995) it has been brought to my attention that Numpy/Scipy will not support such a large number of non-zeros in the resulting sparse matrix.

基本上,W是稀疏矩阵,而Q(或np.log(Q)-1)是密集矩阵.当将一个密集矩阵与一个稀疏矩阵相乘时,所得乘积也将以稀疏矩阵形式表示(这很有意义).但是,请注意,由于我的W矩阵中没有零行,因此结果乘积W*(np.log(Q)-1)将具有nnz > 2^31(220万乘以2000),并且这超出了当前版本中稀疏矩阵中元素的最大数量Scipy.

Basically, W is a sparse matrix, and Q (or np.log(Q)-1) is a dense matrix. When multiplying a dense matrix with a sparse one, the resulting product will also be represented in sparse matrix form (which makes a lot of sense). However, note that since I have no zero rows in my W matrix, the resulting product W*(np.log(Q)-1) will have nnz > 2^31 (2.2 million multiplied by 2000) and this exceeds the maximum number of elements in a sparse matrix in current versions of Scipy.

在这个阶段,我不确定除使用另一种语言重新实现外,还有什么其他方法可以使它工作.也许仍然可以在Python中完成,但是最好编写一个C ++和Eigen实现.

At this stage, I'm not sure how else to get this to work, barring a re-implementation in another language. Perhaps it can still be done in Python, but it might be better to just write up a C++ and Eigen implementation.

特别感谢 pv.在此方面提供了帮助,以查明确切的问题,并感谢其他所有人头脑风暴!

A special thanks to pv. for helping out on this to pinpoint the exact issue, and thanks to everyone else for the brainstorming!

这篇关于巨大的数据集的Python双重免费错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆