稀疏矩阵总和导致密集矩阵-如何强制结果稀疏? [英] scipy sparse matrix sum results in a dense matrix - how to enforce result sparseness?

查看:123
本文介绍了稀疏矩阵总和导致密集矩阵-如何强制结果稀疏?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在scipy.sparse.csr_matrix的一个轴上求和将产生一个numpy.matrix对象. 鉴于我的稀疏矩阵确实很稀疏,我发现这种行为非常烦人.

Summing over one axis of a scipy.sparse.csr_matrix results in a numpy.matrix object. Given that my sparse matrix is really sparse, I find extremely annoying this behaviour.

这里是一个例子:

dense = [[ 0.,  0.,  0.,  0.,  0.],
         [ 1.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.],
         [ 2.,  0.,  4.,  0.,  0.]]


from scipy.sparse import csr_matrix
sparse = csr_matrix(dense)

print(sparse.sum(1))

结果:

matrix([[ 0.],
        [ 1.],
        [ 0.],
        [ 0.],
        [ 6.]])

如何在不对矩阵进行隐式转换为密集格式的情况下在列总和运算中实施稀疏性? 在此示例中,我仅使用了一个小的n矩阵,但是我的矩阵大得多且稀疏,因此通过密集表示会浪费大量空间.

How can I enforce sparseness in the sum over columns operation without implicitly converting the matrix to dense format? In this example I've just used a small n matrix, but my matrix is far larger, and sparser, so its a large waste of space to pass through dense representation.

推荐答案

sparse通过矩阵乘法执行求和:

sparse performs the sum with a matrix multiplication:

In [136]: np.matrix(np.ones(M.shape[1]))@M                                      
Out[136]: matrix([[3., 0., 4., 0., 0.]])
In [137]: M@np.matrix(np.ones((M.shape[1],1)))                                  
Out[137]: 
matrix([[0.],
        [1.],
        [0.],
        [0.],
        [6.]])
In [138]: timeit M@np.matrix(np.ones((M.shape[1],1)))                           
91.5 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [139]: timeit M.sum(1)                                                       
96.6 µs ± 647 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

时间是相似的.两者都会产生np.matrix结果.

The times are similar. Both produce an np.matrix result.

如果改为乘以2d数组,我得到的是数组结果,而且出乎意料的是更好的时间:

If multiply with a 2d array instead, I get an array result, and somewhat surprisingly a much better time:

In [140]: timeit M@np.ones((M.shape[1],1))                                      
24.4 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [141]: M@np.ones((M.shape[1],1))                                             
Out[141]: 
array([[0.],
       [1.],
       [0.],
       [0.],
       [6.]])

我可以将该数组放回稀疏矩阵中-但要花一些时间:

I could put that array back into a sparse matrix - but at a time cost:

In [142]: csr_matrix(M@np.ones((M.shape[1],1)))                                 
Out[142]: 
<5x1 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>
In [143]: timeit csr_matrix(M@np.ones((M.shape[1],1)))                          
391 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

或者我们可以先创建一个稀疏矩阵:

Or we could create a sparse matrix first:

In [144]: M@csr_matrix(np.ones((M.shape[1],1)))                                 
Out[144]: 
<5x1 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>
In [145]: timeit M@csr_matrix(np.ones((M.shape[1],1)))                          
585 µs ± 5.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

即使从循环中删除提取器矩阵的创建也会导致速度降低:

Even removing the extractor matrix creation from the loop results in a slower speed:

In [146]: %%timeit m1 = csr_matrix(np.ones((M.shape[1],1))) 
     ...: M@m1                                                                     
227 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

像这样

sum(几乎)总是增加结果的密度.每行至少有一个非零值的矩阵比有许多纯零行的矩阵更常见.在现实生活中,时间可能会有所不同,但是尝试使用相同的内存可能并不会给您带来多少好处.

sum like this (nearly) always increases the density of the result. Matrices with atleast one nonzero value per row are more common than ones with many pure zero rows. Timings in your real world case might be different, but trying to same memory might not buy you that much.

如果我更详细地研究稀疏矩阵乘法产生的csr矩阵:

If I look in more detail at the csr matrix produced by the sparse matrix multiplication:

In [147]: res = M@csr_matrix(np.ones((M.shape[1],1)))                           
In [148]: res                                                                   
Out[148]: 
<5x1 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>
In [149]: res.indptr                                                            
Out[149]: array([0, 0, 1, 1, 1, 2], dtype=int32)
In [150]: res.indices                                                           
Out[150]: array([0, 0], dtype=int32)

indptr数组每行(+1)都有一个值,因此此列矩阵的内存使用实际上高于密集等效项. csc格式的相同res会更紧凑,带有2个元素indptr.

The indptr array has one value per row (+1), So the memory use of this column matrix is actually higher than the dense equivalent. The same res in csc format would be more compact, with a 2 element indptr.

还可以直接使用csr矩阵的indptrindicesdata属性,本质上是在行上进行迭代并对其求和,然后从中创建一个新的稀疏矩阵.在某些情况下,与sparse方法相比,我们已经提高了速度.但是您必须了解数据存储,并要对整个过程保持精明.

It is also possible to work directly with the indptr , indices, data attributes of the csr matrix, essentially iterating on rows and summing each, and from that create a new sparse matrix. In some cases we've achieved speed improvements compared to the sparse methods. But you have to understand that data storage, and be smart about the whole thing.

这篇关于稀疏矩阵总和导致密集矩阵-如何强制结果稀疏?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆