使用scipy.sparse.csc_matrix.toarray()将稀疏矩阵转换为数组时出错 [英] Error Converting Sparse Matrix to Array with scipy.sparse.csc_matrix.toarray()

查看:452
本文介绍了使用scipy.sparse.csc_matrix.toarray()将稀疏矩阵转换为数组时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 scipy.sparse.csc_matrix scipy.sparse.csc_matrix.toarray()将其转换为数组.当我将函数用于较小的数据集时,它可以正常工作.但是,当我将其用于大型数据集时,python解释器在调用该函数时立即崩溃,并且窗口关闭且没有错误消息.我尝试转换为数组的矩阵是使用sklearn.feature_extraction.text.CountVectorizer创建的.我在Ubuntu 12.04上运行python 2.7.3.更为复杂的是,当我尝试从终端运行脚本以保存任何错误消息时,该日志未记录任何错误消息,并且确实在脚本中更早地停止了(尽管如果未调用toarray()则已完成). /p>

I have a scipy.sparse.csc_matrix that I am trying to transform into an array with scipy.sparse.csc_matrix.toarray(). When I use the function for a small dataset it works fine. However, when I use it for a large dataset, the python interpreter immediately crashes upon calling the function and the window closes without an error message. The matrix I am trying to transform into an array was created with sklearn.feature_extraction.text.CountVectorizer. I am running python 2.7.3 on Ubuntu 12.04. To complicate matters, when I try to run the script from the terminal in order to save any error message, the log records no error message and indeed stops much earlier in the script (despite being complete if toarray() is not called).

推荐答案

您不能在大型稀疏矩阵上调用toarray,因为它将尝试将所有值(包括零)显式存储在连续的内存块中.

You cannot call toarray on a large sparse matrix as it will try to store all the values (including the zeros) explicitly in a continuous chunk of memory.

让我们举个例子,假设您有一个稀疏矩阵A:

Let's take and example, assume you have sparse matrix A:

>>> A.shape
(10000, 100000)
>>> A.nnz              # non zero entries
47231
>>> A.dtype.itemsize
8

以MB为单位的非零数据的大小为:

The size of the non-zeros data in MB is:

>>> (A.nnz * A.dtype.itemsize) / 1e6
0.377848

您可以检查它是否与稀疏矩阵数据结构的data数组的大小匹配:

You can check that this matches the size of the data array of the sparse matrix data-structure:

>>> A.data / 1e6
0.377848

根据稀疏矩阵数据结构(CSR,CSC,COO ...)的类型,它还以各种方式存储非零条目的位置.通常,这大约会使内存使用量增加一倍.因此,A使用的总内存约为700kB.

Depending on the kind of sparse matrix data-structure (CSR, CSC, COO...), it also stores the location of the non-zero entries in various ways. In general this approximately doubles the memory usage. So the total memory used by A is in the order of 700kB.

转换为连续数组表示形式将使内存中的所有零变为实物,结果大小为:

Converting to the contiguous array representation would materialize all the zeros in memory and the resulting size would be:

>>> A.shape[0] * A.shape[1] * A.dtype.itemsize / 1e6
8000.0

此示例为8GB,而原始的稀疏表示小于1MB.

That's 8GB for this example, compared to less than 1MB for the original sparse representation.

这篇关于使用scipy.sparse.csc_matrix.toarray()将稀疏矩阵转换为数组时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆