具有大稀疏矩阵的大循环的 Python 3 内存错误 [英] Python 3 memory error for large loop with large sparse matrix

查看:46
本文介绍了具有大稀疏矩阵的大循环的 Python 3 内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个代码,该代码涉及创建一个较大的相对稀疏矩阵并使用它解决最小二乘法最小化问题.然而,当我运行我的代码时,我一直遇到内存错误,尽管矩阵似乎不应该大到足以使我的系统承受压力(在我的测试用例中大约为 27000 x 2100)

I am working with a code that involves the creating of a large relatively sparse matrix and the solution of a least squares minimization problem using it. However, I have been getting memory errors when I run my code, despite the fact that it seems that the matrix should not be large enough to strain my system (27000 by 2100 roughly, in my test cases)

我创建了一个简化的代码,它与我的测试用例具有相同的存储要求,并且还生成了一个内存错误(请注意,稀疏"矩阵实际上并不是非常稀疏,因为我正在测试的问题规模小于实际预期的数据集将需要):

I have created a simplified code that has the same storage requirements as my test case and also generates a memory error (note that the "sparse" matrix is not actually very sparse as I am testing with a smaller scale problem than what the actual intended dataset will entail):

import numpy as np
from scipy import sparse

BM = sparse.lil_matrix((27000, 3000))
for i in range(0, 3000):
    local_mat = np.random.rand(30,30,30)
    local_mat[local_mat<0.1] = 0
    vals = local_mat.ravel()
    nonzero = vals.nonzero()
    BM[nonzero, i] = vals[nonzero]

如果我改变参数使得稀疏矩阵中有更多的零条目,在填充矩阵的行并用它执行最小化问题后,我仍然会从 scipy.sparse.linalg.lsq_linear 得到一个内存错误

If I change the parameters such that there are more zero entries in the sparse matrix, I will still get a memory error from scipy.sparse.linalg.lsq_linear after filling the rows of the matrix and performing a minimization problem with it

不用说,如果我也使用密集矩阵,我会得到一个内存错误.

It goes without saying that I get a memory error if I use a dense matrix as well.

我曾尝试将我的分页文件的大小增加到 2-4 GB,但这并没有帮助,尽管无论如何这似乎不应该占用那么多内存

I have tried to increase the size of my paging file to 2-4 gigabytes but that hasn't helped, though it seems like this should not be that memory intensive regardless

推荐答案

你创造了一个记忆篝火.让我们回忆一下 lil_matrix 是如何工作的:

You've created a memory bonfire. Let's recall how a lil_matrix works:

矩阵的每一行都保存为一个 python 列表.有一个数据列表和另一个索引列表.每个非零元素在每个列表上都有一个条目.

Each row of the matrix is kept as a python list. There is one list for data and another list for index. Each non-zero element has one entry on each list.

以你的例子 27000, 3000 矩阵为例,让我们把它加起来.python 列表中的每个条目都是一个消耗 16 字节开销的对象.对于数据列表,浮点数据本身又是8个字节.所以只有非零值会占用 2GB.现在让我们看看索引 - 另一组列表,其中每个条目消耗 16 字节的开销.噗——还有 1.5 GB 没了.为列表和保存它们的数组增加了一点内存,但与已经下降的内存相比,这并不多.

Taking your example 27000, 3000 matrix, let's add this up. Each entry in a python list is an object which burns 16 bytes for overhead. For the data list, the float data itself is another 8 bytes. So just the non-zero values eats up 2GB. Now let's look at the indices - another set of lists each entry of which burns 16 bytes for overhead. Poof - there's another 1.5 GB gone. Bit more memory for the lists and the array that holds them, but that's not much compared to what's already gone down.

与占用 650MB 内存的 27000, 3000 密集阵列相比,您设法使用了额外 5 倍的内存,同时还获得了几个主要 与 numpy 数组相比的缺点.像 CSR 这样的标准稀疏格式将是一个更好的选择 - 对于 90% 密集的示例,它只会使用大约 900MB 的内存.

Compared to a 27000, 3000 dense array, which eats up 650MB of memory, you've managed to use an extra 5x as much memory, while also acquiring several major disadvantages compared to a numpy array. A standard sparse format like CSR would be a much better choice - that would use just about 900MB of memory for your 90% dense example.

您还试图调用一个绝对需要 CSR 矩阵的函数,因此该函数将您创建的可怕的 lil_matrix 转换为 lil_matrixcode>csr_matrix,强制您在内存中保留此数据的两个完整副本.

Also you're trying to call a function which absolutely requires a CSR matrix, so that function is casting the horrible lil_matrix you've created into a csr_matrix, forcing you to keep two full copies of this data in-memory.

这篇关于具有大稀疏矩阵的大循环的 Python 3 内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆