Python的稀疏LIL矩阵中极慢的求和行运算 [英] Extremely slow sum row operation in Sparse LIL matrix in Python

查看:166
本文介绍了Python的稀疏LIL矩阵中极慢的求和行运算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用Python编写了这段代码,该代码可以达到预期的效果,但是速度却非常慢.瓶颈是scipy.sparse.lil_matrix的多行求和.我该如何快速?

I have written this code in Python that is giving expected results but is extremely extremely slow. The bottleneck is the summing multiple rows of the scipy.sparse.lil_matrix. How can I make it fast?

# D1 is a 1.5M x 1.3M sparse matrix, read as scipy.sparse.lil_matrix.
# D2 is a 1.5M x 111 matrix, read as numpy.array
# F1 is a csv file, read using csv.reader

for row in F1:
    user_id = row[0]
    clust = D2[user_id, 110]
    neighbors = D2[ D2[:, 110] == clust][:,1]
    score = np.zeros(1300000)

    for neigh in neighbors:
        score = score + D1 [neigh, :] # the most expensive operation

    toBeWritten = np.argsort(score)[:,::-1].A[0,:]

请让我知道是否还有其他不是很理想的东西.

Please let me know if there is something else too that is not very optimal.

推荐答案

首先是一个具有非常小的矩阵的演示

First a demo with a very small matrix

In [523]: idx=np.arange(0,8,2)
In [526]: D=np.arange(24).reshape(8,3)
In [527]: Dll=sparse.lil_matrix(D)

In [528]: D[idx,:].sum(axis=0)
Out[528]: array([36, 40, 44])

In [529]: Dll[idx,:].sum(axis=0)
Out[529]: matrix([[36, 40, 44]], dtype=int32)

In [530]: timeit D[idx,:].sum(axis=0)
100000 loops, best of 3: 17.3 µs per loop

In [531]: timeit Dll[idx,:].sum(axis=0)
1000 loops, best of 3: 1.16 ms per loop

In [532]: score=np.zeros(3)      # your looping version
In [533]: for i in idx:
   .....:     score = score + Dll[i,:]

In [534]: score
Out[534]: matrix([[ 36.,  40.,  44.]])

In [535]: %%timeit
   .....: score=np.zeros(3)
   .....: for i in idx:
    score = score + Dll[i,:]
   .....: 
100 loops, best of 3: 2.76 ms per loop

对于某些操作,csr格式更快:

For some operations the csr format is faster:

In [537]: timeit Dll.tocsr()[idx,:].sum(axis=0)
1000 loops, best of 3: 955 µs per loop

或者如果我预先转换为csr:

or if I preconvert to csr:

In [538]: Dcsr=Dll.tocsr()

In [539]: timeit Dcsr[idx,:].sum(axis=0)
1000 loops, best of 3: 724 µs per loop

相对于密集,还是慢的.

Still slow relative to dense.

我将要讨论使用稀疏矩阵的数据属性作为更快选择行的方法.但是,如果选择这些行的唯一目的是求和它们的值,那么我们就不需要这样做.

I was going to talk about working with the data attributes of the sparse matrix as a way of selecting rows faster. But if the only purpose for selecting these rows is to sum their values we don't need to do that.

稀疏矩阵对行或列求和,方法是对列或行矩阵为1的矩阵乘积进行求和.我只是用相同的答案回答了另一个问题.

Sparse matrices sum on rows or columns by doing a matrix product with a column or row matrix of ones. And I just answered another question with the same answer.

https://stackoverflow.com/a/37120235/901925 Efficiently compute columnwise sum of sparse array where every non-zero element is 1

例如:

In [588]: I=np.asmatrix(np.zeros((1,Dll.shape[0])))    
In [589]: I[:,idx]=1
In [590]: I
Out[590]: matrix([[ 1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.]])
In [591]: I*Dll
Out[591]: matrix([[ 36.,  40.,  44.]])

In [592]: %%timeit 
I=np.asmatrix(np.zeros((1,Dll.shape[0])))
I[:,idx]=1
I*Dll
   .....: 
1000 loops, best of 3: 919 µs per loop

对于这个小的矩阵,它对速度没有帮助,但是随着Dcsr时间下降到215 µs(对于数学而言,这要好得多).如果使用大型矩阵,此产品版本将有所改进.

For this small matrix it did not help the speed, but with the Dcsr time drops to 215 µs (it's much better for math). With large matrices this product version will improve.

=================

=================

在另一个问题中,我刚刚发现A_csr[[1,1,0,3],:]行选择实际上是通过矩阵乘积完成的.它构造一个看起来像

I just found out, in another question, that a A_csr[[1,1,0,3],:] row selection is actually done with a matrix product. It constructs an 'extractor' csr matrix that looks like

matrix([[0, 1, 0, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])

https://stackoverflow.com/a/372​​45105/901925

这篇关于Python的稀疏LIL矩阵中极慢的求和行运算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆