Scipy 稀疏...数组? [英] Scipy sparse... arrays?

查看:26
本文介绍了Scipy 稀疏...数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我正在使用非常稀疏的 numpy 数组进行一些 Kmeans 分类 - 很多很多零.我想我会使用 scipy 的 'sparse' 包来减少存储开销,但我对如何创建数组而不是矩阵有点困惑.

我已经阅读了有关如何创建稀疏矩阵的教程:http://www.scipy.org/SciPy_Tutorial#head-c60682414b7fd29f3824b4b4b4b4b4b4b7f7f9f7f90df7f18b90df7

为了模拟一个数组,我只是创建了一个 1xN 矩阵,但正如您可能猜到的那样,Asp.dot(Bsp) 不太适用,因为您不能将两个 1xN 矩阵相乘.我必须将每个数组转置为 Nx1,这很蹩脚,因为我会为每个点积计算都这样做.

接下来,我尝试创建一个 NxN 矩阵,其中第 1 列 == 第 1 行(这样您就可以将两个矩阵相乘,然后将左上角作为点积),但结果证明这是非常低效的.

我很想使用 scipy 的 sparse 包作为 numpy 的 array() 的神奇替代品,但到目前为止,我还不确定该怎么做.

有什么建议吗?

使用基于行或列的 scipy.sparse 格式:csc_matrixcsr_matrix.

这些在引擎盖下使用高效的 C 实现(包括乘法),并且转置是空操作(尤其是如果您调用 transpose(copy=False)),就像使用 numpy 数组一样.

通过 ipython 的一些时间:

import numpy, scipy.sparsen = 100000x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% 稀疏向量x_csr = scipy.sparse.csr_matrix(x)x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))

现在 x_csrx_dok 是 50% 稀疏的:

print repr(x_csr)<1x100000 类型的稀疏矩阵 '<type 'numpy.float64'>'以压缩稀疏行格式存储 49757 个元素>

还有时间:

timeit numpy.dot(x, x)10000 个循环,最好的 3 个:每个循环 123 us时间 x_dok * x_dok.T1 个循环,最好的 3 个:每个循环 1.73 秒时间 x_csr.multiply(x_csr).sum()1000 个循环,最好的 3 个:每个循环 1.64 毫秒时间 x_csr * x_csr.T100 个循环,最好的 3 个:每个循环 3.62 毫秒

所以看起来我撒了谎.转置非常便宜,但是没有 csr * csc 的高效 C 实现(在最新的 scipy 0.9.0 中).每次调用都会构造一个新的 csr 对象:-(

作为一个 hack(虽然 scipy 现在相对稳定),你可以直接在稀疏数据上做点积:

timeit numpy.dot(x_csr.data, x_csr.data)10000 个循环,最好的 3 个:每个循环 62.9 us

请注意,最后一种方法再次进行了 numpy 密集乘法.稀疏度为 50%,因此它实际上比 dot(x, x) 快 2 倍.

So, I'm doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I'd use scipy's 'sparse' package to reduce the storage overhead, but I'm a little confused about how to create arrays, not matrices.

I've gone through this tutorial on how to create sparse matrices: http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7

To mimic an array, I just create a 1xN matrix, but as you may guess, Asp.dot(Bsp) doesn't quite work because you can't multiply two 1xN matrices. I'd have to transpose each array to Nx1, and that's pretty lame, since I'd be doing it for every dot-product calculation.

Next up, I tried to create an NxN matrix where column 1 == row 1 (such that you can multiply two matrices and just take the top-left corner as the dot product), but that turned out to be really inefficient.

I'd love to use scipy's sparse package as a magic replacement for numpy's array(), but as yet, I'm not really sure what to do.

Any advice?

解决方案

Use a scipy.sparse format that is row or column based: csc_matrix and csr_matrix.

These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call transpose(copy=False)), just like with numpy arrays.

EDIT: some timings via ipython:

import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))

Now x_csr and x_dok are 50% sparse:

print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
        with 49757 stored elements in Compressed Sparse Row format>

And the timings:

timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop

timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop

timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop

timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop

So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-(

As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data:

timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop

Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than dot(x, x) by a factor of 2.

这篇关于Scipy 稀疏...数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆