使用稀疏矩阵与 numpy 数组 [英] Using a sparse matrix versus numpy array
问题描述
我正在用 Python 创建一些带有字数统计的 numpy 数组:行是文档,列是单词 X 的计数.如果我有很多零计数,人们建议在进一步处理这些时使用稀疏矩阵,例如在分类器中.将 numpy 数组与稀疏矩阵输入 Scikit 逻辑回归分类器,然而,它似乎没有太大区别.所以我想知道三件事:
I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things:
维基百科说
稀疏矩阵是其中大部分元素为零的矩阵
a sparse matrix is a matrix in which most of the elements are zero
这是确定何时使用稀疏矩阵的合适方法吗格式 - 只要 > 50 % 的值为零?或者它使以防万一?
Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?
非常感谢任何帮助!
推荐答案
scipy
稀疏矩阵包和 MATLAB 中的类似包,是基于从线性代数问题发展而来的思想,例如解决大型稀疏线性方程(例如有限差分和有限元实现).因此,诸如矩阵乘积(numpy 数组的 dot
乘积)和方程求解器之类的东西得到了很好的发展.
The scipy
sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the dot
product for numpy arrays) and equation solvers are well developed.
我的粗略经验是,稀疏 csr
矩阵乘积必须具有 1% 的稀疏度才能比等效的密集 dot
操作更快 - 换句话说,一个非零每 99 个零的值.(但请参阅下面的测试)
My rough experience is that a sparse csr
matrix product has to have a 1% sparsity to be faster than the equivalent dense dot
operation - in other words, one nonzero value for every 99 zeros. (but see tests below)
但人们也尝试使用稀疏矩阵来节省内存.但请记住,这样的矩阵必须存储 3 个值数组(至少以 coo
格式).所以稀疏度必须小于 1/3 才能开始节省内存.显然,如果您首先构建密集数组,然后从中创建稀疏数组,则不会节省内存.
But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the coo
format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.
scipy
包实现了许多稀疏格式.coo
格式最容易理解和构建.根据文档构建一个并查看其 .data
、.row
和 .col
属性(3 个一维数组).
The scipy
package implements many sparse formats. The coo
format is easiest to understand and build. Build one according to documentation and look at its .data
, .row
, and .col
attributes (3 1d arrays).
csr
和 csc
通常是从 coo
格式构建的,并对数据进行了一些压缩,使它们更难理解.但它们具有大部分数学功能.
csr
and csc
are typically built from the coo
format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.
也可以索引 csr
格式,尽管通常这比等效的密集矩阵/数组情况慢.其他操作,如更改值(尤其是从 0 到非零)、串联、增量增长,也较慢.
It is also possible to index csr
format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.
lil
(列表列表)也很容易理解,最适合增量构建.dok
实际上是一个字典子类.
lil
(lists of lists) is also easy to understand, and best for incremental building. dok
is a actually a dictionary subclass.
一个关键点是稀疏矩阵仅限于 2d,并且在许多方面的行为类似于 np.matrix
类(尽管它不是子类).
A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the np.matrix
class (though it isn't a subclass).
使用 scikit-learn
和 sparse
搜索其他问题可能是找出使用这些矩阵的优缺点的最佳方式.我已经回答了许多问题,但我比学习"方面更了解稀疏"方面.我认为它们很有用,但我觉得合身并不总是最好的.任何自定义都在 learn
端.到目前为止,sparse
包尚未针对此应用程序进行优化.
A search for other questions using scikit-learn
and sparse
might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the learn
side. So far the sparse
package has not been optimized for this application.
我刚刚尝试了一些矩阵乘积测试,使用 sparse.random
方法创建具有指定稀疏度的稀疏矩阵.稀疏矩阵乘法的表现比我预期的要好.
I just tried some matrix product tests, using the sparse.random
method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.
In [251]: M=sparse.random(1000,1000,.5)
In [252]: timeit M1=M*M
1 loops, best of 3: 2.78 s per loop
In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1 loops, best of 3: 4.28 s per loop
这是尺寸问题;对于较小的矩阵,密集的 dot
速度更快
It is a size issue; for smaller matrix the dense dot
is faster
In [255]: M=sparse.random(100,100,.5)
In [256]: timeit M1=M*M
100 loops, best of 3: 3.24 ms per loop
In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1000 loops, best of 3: 1.44 ms per loop
但比较索引
In [268]: timeit M.tocsr()[500,500]
10 loops, best of 3: 86.4 ms per loop
In [269]: timeit Ma[500,500]
1000000 loops, best of 3: 318 ns per loop
In [270]: timeit Ma=M.toarray();Ma[500,500]
10 loops, best of 3: 23.6 ms per loop
这篇关于使用稀疏矩阵与 numpy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!