使用稀疏矩阵与numpy数组 [英] Using a sparse matrix versus numpy array

查看:116
本文介绍了使用稀疏矩阵与numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Python创建一些带有单词计数的numpy数组:行是文档,列是单词X的计数.如果我有很多零计数,人们建议在进一步处理它们时使用稀疏矩阵,例如在分类器中.将numpy数组与稀疏矩阵一起馈入Scikit时逻辑回归分类器,但是它似乎并没有太大的区别.所以我想知道三件事:

I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things:

稀疏矩阵是其中大多数元素为零的矩阵

a sparse matrix is a matrix in which most of the elements are zero

是确定何时使用稀疏矩阵的适当方法 格式-大于50%的值是否为零?还是让 以防万一?

Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?

非常感谢您的帮助!

推荐答案

scipy稀疏矩阵包以及MATLAB中的类似软件包,都是基于线性代数问题提出的思想,例如求解大型稀疏线性方程(例如,有限差分和有限元实现).因此,矩阵乘积(用于numpy数组的dot乘积)和方程求解器之类的东西已经很好地开发了.

The scipy sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the dot product for numpy arrays) and equation solvers are well developed.

我的粗略经验是,稀疏的csr矩阵乘积必须具有1%的稀疏度,才能比等效的密集dot操作快-换句话说,每99个零为一个非零值. (但请参见下面的测试)

My rough experience is that a sparse csr matrix product has to have a 1% sparsity to be faster than the equivalent dense dot operation - in other words, one nonzero value for every 99 zeros. (but see tests below)

但是人们也尝试使用稀疏矩阵来节省内存.但是请记住,这样的矩阵必须存储3个值的数组(至少以coo格式).因此稀疏度必须小于1/3才能开始保存内存.显然,如果您首先构建密集阵列并从中创建稀疏阵列,那么您将不会节省内存.

But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the coo format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.

scipy包实现了许多稀疏格式. coo格式最容易理解和构建.根据文档构建一个,并查看其.data.row.col属性(3个1d数组).

The scipy package implements many sparse formats. The coo format is easiest to understand and build. Build one according to documentation and look at its .data, .row, and .col attributes (3 1d arrays).

csrcsc通常是从coo格式构建的,并且对数据进行一点压缩,这使它们更难理解.但是它们具有大多数数学功能.

csr and csc are typically built from the coo format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.

也可以索引csr格式,尽管通常这比等效的密集矩阵/数组情况要慢.更改值(尤其是从0到非零),级联,增量增长等其他操作的速度也较慢.

It is also possible to index csr format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.

lil(列表列表)也很容易理解,最适合增量构建. dok实际上是字典的子类.

lil (lists of lists) is also easy to understand, and best for incremental building. dok is a actually a dictionary subclass.

关键是稀疏矩阵仅限于2d,并且在许多方面的行为类似于np.matrix类(尽管它不是子类).

A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the np.matrix class (though it isn't a subclass).

使用scikit-learnsparse搜索其他问题可能是找到使用这些矩阵的利弊的最佳方法.我已经回答了许多问题,但是我比学习"方面更了解稀疏"方面.我认为它们很有用,但我感觉到合身并不总是最好的.任何定制都在learn一侧.到目前为止,sparse软件包尚未针对该应用程序进行优化.

A search for other questions using scikit-learn and sparse might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the learn side. So far the sparse package has not been optimized for this application.

我刚刚尝试了一些矩阵乘积测试,使用sparse.random方法创建了具有指定稀疏度的稀疏矩阵.稀疏矩阵乘法的性能比我预期的好.

I just tried some matrix product tests, using the sparse.random method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.

In [251]: M=sparse.random(1000,1000,.5)

In [252]: timeit M1=M*M
1 loops, best of 3: 2.78 s per loop

In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1 loops, best of 3: 4.28 s per loop

这是一个大小问题;对于较小的矩阵,密集的dot更快

It is a size issue; for smaller matrix the dense dot is faster

In [255]: M=sparse.random(100,100,.5)

In [256]: timeit M1=M*M
100 loops, best of 3: 3.24 ms per loop

In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1000 loops, best of 3: 1.44 ms per loop

但是比较索引

In [268]: timeit M.tocsr()[500,500]
10 loops, best of 3: 86.4 ms per loop

In [269]: timeit Ma[500,500]
1000000 loops, best of 3: 318 ns per loop

In [270]: timeit Ma=M.toarray();Ma[500,500]
10 loops, best of 3: 23.6 ms per loop

这篇关于使用稀疏矩阵与numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆