使用稀疏矩阵与 numpy 数组 [英] Using a sparse matrix versus numpy array

查看:36
本文介绍了使用稀疏矩阵与 numpy 数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 Python 创建一些带有字数统计的 numpy 数组:行是文档,列是单词 X 的计数.如果我有很多零计数,人们建议在进一步处理这些时使用稀疏矩阵,例如在分类器中.将 numpy 数组与稀疏矩阵输入 Scikit 逻辑回归分类器,然而,它似乎没有太大区别.所以我想知道三件事:

I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things:

稀疏矩阵是其中大部分元素为零的矩阵

a sparse matrix is a matrix in which most of the elements are zero

这是确定何时使用稀疏矩阵的合适方法吗格式 - 只要 > 50 % 的值为零?或者它使以防万一?

Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?

非常感谢任何帮助!

推荐答案

scipy 稀疏矩阵包和 MATLAB 中的类似包,是基于从线性代数问题发展而来的思想,例如解决大型稀疏线性方程(例如有限差分和有限元实现).因此,诸如矩阵乘积(numpy 数组的 dot 乘积)和方程求解器之类的东西得到了很好的发展.

The scipy sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the dot product for numpy arrays) and equation solvers are well developed.

我的粗略经验是,稀疏 csr 矩阵乘积必须具有 1% 的稀疏度才能比等效的密集 dot 操作更快 - 换句话说,一个非零每 99 个零的值.(但请参阅下面的测试)

My rough experience is that a sparse csr matrix product has to have a 1% sparsity to be faster than the equivalent dense dot operation - in other words, one nonzero value for every 99 zeros. (but see tests below)

但人们也尝试使用稀疏矩阵来节省内存.但请记住,这样的矩阵必须存储 3 个值数组(至少以 coo 格式).所以稀疏度必须小于 1/3 才能开始节省内存.显然,如果您首先构建密集数组,然后从中创建稀疏数组,则不会节省内存.

But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the coo format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.

scipy 包实现了许多稀疏格式.coo 格式最容易理解和构建.根据文档构建一个并查看其 .data.row.col 属性(3 个一维数组).

The scipy package implements many sparse formats. The coo format is easiest to understand and build. Build one according to documentation and look at its .data, .row, and .col attributes (3 1d arrays).

csrcsc 通常是从 coo 格式构建的,并对数据进行了一些压缩,使它们更难理解.但它们具有大部分数学功能.

csr and csc are typically built from the coo format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.

也可以索引 csr 格式,尽管通常这比等效的密集矩阵/数组情况慢.其他操作,如更改值(尤其是从 0 到非零)、串联、增量增长,也较慢.

It is also possible to index csr format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.

lil(列表列表)也很容易理解,最适合增量构建.dok 实际上是一个字典子类.

lil (lists of lists) is also easy to understand, and best for incremental building. dok is a actually a dictionary subclass.

一个关键点是稀疏矩阵仅限于 2d,并且在许多方面的行为类似于 np.matrix 类(尽管它不是子类).

A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the np.matrix class (though it isn't a subclass).

使用 scikit-learnsparse 搜索其他问题可能是找出使用这些矩阵的优缺点的最佳方式.我已经回答了许多问题,但我比学习"方面更了解稀疏"方面.我认为它们很有用,但我觉得合身并不总是最好的.任何自定义都在 learn 端.到目前为止,sparse 包尚未针对此应用程序进行优化.

A search for other questions using scikit-learn and sparse might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the learn side. So far the sparse package has not been optimized for this application.

我刚刚尝试了一些矩阵乘积测试,使用 sparse.random 方法创建具有指定稀疏度的稀疏矩阵.稀疏矩阵乘法的表现比我预期的要好.

I just tried some matrix product tests, using the sparse.random method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.

In [251]: M=sparse.random(1000,1000,.5)

In [252]: timeit M1=M*M
1 loops, best of 3: 2.78 s per loop

In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1 loops, best of 3: 4.28 s per loop

这是尺寸问题;对于较小的矩阵,密集的 dot 速度更快

It is a size issue; for smaller matrix the dense dot is faster

In [255]: M=sparse.random(100,100,.5)

In [256]: timeit M1=M*M
100 loops, best of 3: 3.24 ms per loop

In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1000 loops, best of 3: 1.44 ms per loop

但比较索引

In [268]: timeit M.tocsr()[500,500]
10 loops, best of 3: 86.4 ms per loop

In [269]: timeit Ma[500,500]
1000000 loops, best of 3: 318 ns per loop

In [270]: timeit Ma=M.toarray();Ma[500,500]
10 loops, best of 3: 23.6 ms per loop

这篇关于使用稀疏矩阵与 numpy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆