使用稀疏矩阵与numpy数组 [英] Using a sparse matrix versus numpy array
问题描述
我正在用Python创建一些带有单词计数的numpy数组:行是文档,列是单词X的计数.如果我有很多零计数,人们建议在进一步处理它们时使用稀疏矩阵,例如在分类器中.将numpy数组与稀疏矩阵一起馈入Scikit时逻辑回归分类器,但是它似乎并没有太大的区别.所以我想知道三件事:
I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things:
-
维基百科说
稀疏矩阵是其中大多数元素为零的矩阵
a sparse matrix is a matrix in which most of the elements are zero
是确定何时使用稀疏矩阵的适当方法 格式-大于50%的值是否为零?还是让 以防万一?
Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?
非常感谢您的帮助!
推荐答案
scipy
稀疏矩阵包以及MATLAB中的类似软件包,都是基于线性代数问题提出的思想,例如求解大型稀疏线性方程(例如,有限差分和有限元实现).因此,矩阵乘积(用于numpy数组的dot
乘积)和方程求解器之类的东西已经很好地开发了.
The scipy
sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the dot
product for numpy arrays) and equation solvers are well developed.
我的粗略经验是,稀疏的csr
矩阵乘积必须具有1%的稀疏度,才能比等效的密集dot
操作快-换句话说,每99个零为一个非零值. (但请参见下面的测试)
My rough experience is that a sparse csr
matrix product has to have a 1% sparsity to be faster than the equivalent dense dot
operation - in other words, one nonzero value for every 99 zeros. (but see tests below)
但是人们也尝试使用稀疏矩阵来节省内存.但是请记住,这样的矩阵必须存储3个值的数组(至少以coo
格式).因此稀疏度必须小于1/3才能开始保存内存.显然,如果您首先构建密集阵列并从中创建稀疏阵列,那么您将不会节省内存.
But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the coo
format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.
scipy
包实现了许多稀疏格式. coo
格式最容易理解和构建.根据文档构建一个,并查看其.data
,.row
和.col
属性(3个1d数组).
The scipy
package implements many sparse formats. The coo
format is easiest to understand and build. Build one according to documentation and look at its .data
, .row
, and .col
attributes (3 1d arrays).
csr
和csc
通常是从coo
格式构建的,并且对数据进行一点压缩,这使它们更难理解.但是它们具有大多数数学功能.
csr
and csc
are typically built from the coo
format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.
也可以索引csr
格式,尽管通常这比等效的密集矩阵/数组情况要慢.更改值(尤其是从0到非零),级联,增量增长等其他操作的速度也较慢.
It is also possible to index csr
format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.
lil
(列表列表)也很容易理解,最适合增量构建. dok
实际上是字典的子类.
lil
(lists of lists) is also easy to understand, and best for incremental building. dok
is a actually a dictionary subclass.
关键是稀疏矩阵仅限于2d,并且在许多方面的行为类似于np.matrix
类(尽管它不是子类).
A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the np.matrix
class (though it isn't a subclass).
使用scikit-learn
和sparse
搜索其他问题可能是找到使用这些矩阵的利弊的最佳方法.我已经回答了许多问题,但是我比学习"方面更了解稀疏"方面.我认为它们很有用,但我感觉到合身并不总是最好的.任何定制都在learn
一侧.到目前为止,sparse
软件包尚未针对该应用程序进行优化.
A search for other questions using scikit-learn
and sparse
might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the learn
side. So far the sparse
package has not been optimized for this application.
我刚刚尝试了一些矩阵乘积测试,使用sparse.random
方法创建了具有指定稀疏度的稀疏矩阵.稀疏矩阵乘法的性能比我预期的好.
I just tried some matrix product tests, using the sparse.random
method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.
In [251]: M=sparse.random(1000,1000,.5)
In [252]: timeit M1=M*M
1 loops, best of 3: 2.78 s per loop
In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1 loops, best of 3: 4.28 s per loop
这是一个大小问题;对于较小的矩阵,密集的dot
更快
It is a size issue; for smaller matrix the dense dot
is faster
In [255]: M=sparse.random(100,100,.5)
In [256]: timeit M1=M*M
100 loops, best of 3: 3.24 ms per loop
In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1000 loops, best of 3: 1.44 ms per loop
但是比较索引
In [268]: timeit M.tocsr()[500,500]
10 loops, best of 3: 86.4 ms per loop
In [269]: timeit Ma[500,500]
1000000 loops, best of 3: 318 ns per loop
In [270]: timeit Ma=M.toarray();Ma[500,500]
10 loops, best of 3: 23.6 ms per loop
这篇关于使用稀疏矩阵与numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!