Scipy 稀疏...数组? [英] Scipy sparse... arrays?
问题描述
所以,我正在使用非常稀疏的 numpy 数组进行一些 Kmeans 分类 - 很多很多零.我想我会使用 scipy 的 'sparse' 包来减少存储开销,但我对如何创建数组而不是矩阵有点困惑.
我已经阅读了有关如何创建稀疏矩阵的教程:http://www.scipy.org/SciPy_Tutorial#head-c60682414b7fd29f3824b4b4b4b4b4b4b7f7f9f7f90df7f18b90df7 为了模拟一个数组,我只是创建了一个 1xN 矩阵,但正如您可能猜到的那样,Asp.dot(Bsp) 不太适用,因为您不能将两个 1xN 矩阵相乘.我必须将每个数组转置为 Nx1,这很蹩脚,因为我会为每个点积计算都这样做. 接下来,我尝试创建一个 NxN 矩阵,其中第 1 列 == 第 1 行(这样您就可以将两个矩阵相乘,然后将左上角作为点积),但结果证明这是非常低效的. 我很想使用 scipy 的 sparse 包作为 numpy 的 array() 的神奇替代品,但到目前为止,我还不确定该怎么做. 有什么建议吗? 使用基于行或列的 这些在引擎盖下使用高效的 C 实现(包括乘法),并且转置是空操作(尤其是如果您调用 通过 ipython 的一些时间: 现在 还有时间: 所以看起来我撒了谎.转置非常便宜,但是没有 csr * csc 的高效 C 实现(在最新的 scipy 0.9.0 中).每次调用都会构造一个新的 csr 对象:-( 作为一个 hack(虽然 scipy 现在相对稳定),你可以直接在稀疏数据上做点积: 请注意,最后一种方法再次进行了 numpy 密集乘法.稀疏度为 50%,因此它实际上比 So, I'm doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I'd use scipy's 'sparse' package to reduce the storage overhead, but I'm a little confused about how to create arrays, not matrices. I've gone through this tutorial on how to create sparse matrices:
http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7 To mimic an array, I just create a 1xN matrix, but as you may guess, Asp.dot(Bsp) doesn't quite work because you can't multiply two 1xN matrices. I'd have to transpose each array to Nx1, and that's pretty lame, since I'd be doing it for every dot-product calculation. Next up, I tried to create an NxN matrix where column 1 == row 1 (such that you can multiply two matrices and just take the top-left corner as the dot product), but that turned out to be really inefficient. I'd love to use scipy's sparse package as a magic replacement for numpy's array(), but as yet, I'm not really sure what to do. Any advice? Use a These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call EDIT: some timings via ipython: Now And the timings: So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-( As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data: Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than 这篇关于Scipy 稀疏...数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!scipy.sparse
格式:csc_matrix
和 csr_matrix代码>.
transpose(copy=False)
),就像使用 numpy 数组一样.import numpy, scipy.sparsen = 100000x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% 稀疏向量x_csr = scipy.sparse.csr_matrix(x)x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
x_csr
和 x_dok
是 50% 稀疏的:print repr(x_csr)<1x100000 类型的稀疏矩阵 '<type 'numpy.float64'>'以压缩稀疏行格式存储 49757 个元素>
timeit numpy.dot(x, x)10000 个循环,最好的 3 个:每个循环 123 us时间 x_dok * x_dok.T1 个循环,最好的 3 个:每个循环 1.73 秒时间 x_csr.multiply(x_csr).sum()1000 个循环,最好的 3 个:每个循环 1.64 毫秒时间 x_csr * x_csr.T100 个循环,最好的 3 个:每个循环 3.62 毫秒
timeit numpy.dot(x_csr.data, x_csr.data)10000 个循环,最好的 3 个:每个循环 62.9 us
dot(x, x)
快 2 倍.scipy.sparse
format that is row or column based: csc_matrix
and csr_matrix
. transpose(copy=False)
), just like with numpy arrays.import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
x_csr
and x_dok
are 50% sparse:print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
with 49757 stored elements in Compressed Sparse Row format>
timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop
timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop
timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop
timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop
timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop
dot(x, x)
by a factor of 2.