getrow()的Scipy稀疏矩阵替代 [英] Scipy sparse matrix alternative for getrow()
问题描述
我正在处理大型稀疏二进制矩阵.我使用Scipy
稀疏矩阵实现压缩了它们. scipy.spatial.distance
中的Jaccard distance
的计算不支持对稀疏矩阵的直接运算,因此:
I am working with large sparse binary matrices. I have condensed them using Scipy
sparse matrix implementation. The calculation of Jaccard distance
from scipy.spatial.distance
does not support direct operation on sparse matrices, so either:
-
将整个稀疏矩阵转换为密集矩阵,然后在每一行上将其作为向量进行操作,这是饥饿的记忆
或
遍历稀疏区域,使用getrow()
抓取每一行并进行操作.
Loop through the sparse, grab each row using getrow()
and operate.
或
编写我们自己的实现以处理稀疏矩阵.
Write our own implementation to work on sparse matrices.
下面是示例代码:
import scipy.spatial.distance as d
import numpy as np
from scipy.sparse import csr_matrix
# benchmark performance
X = np.random.random((3000, 3000))
# binarize
X[X > 0.3] = 0
X[X>0] = 1
mat = csr_matrix(X)
a = np.zeros(3000)
a[4] = a[100] = a[22] =1
a = csr_matrix(a)
def jaccard_fast(v1,v2):
common = v1.dot(v2.T)
dis = (v1 != v2).getnnz()
if common[0,0]:
return 1.0-float(common[0,0])/float(common[0,0]+dis)
else:
return 0.0
def benchmark_jaccard_fast():
for i in range(mat.shape[0]):
jaccard_fast(mat.getrow(i),a)
def benchmark_jaccard_internal_todense():
for v1,v2 in zip(mat.todense(),a.todense()):
d.jaccard(v1,v2)
def benchmark_jaccard_internal_getrow():
for i in range(mat.shape[0]):
d.jaccard(mat.getrow(i),a)
print "Jaccard Fast:"
%time benchmark_jaccard_fast()
print "Jaccard Scipy (expanding to dense):"
%time benchmark_jaccard_internal_todense()
print "Jaccard Scipy (using getrow):"
%time benchmark_jaccard_internal_getrow()
其中jaccard_fast
是我自己的实现.在scipy稀疏矩阵上,我的实现似乎比内部实现快,但是getrow()
似乎使我的实现变慢了.当我将jaccard_fast
与scipy.spatial.distance.jaccard
进行基准比较时,结果是:
where jaccard_fast
is my own implementation. It appears that my implementation is faster than the internal one, on scipy sparse matrices, however getrow()
seems to slow my implementation down. As I benchmark jaccard_fast
against scipy.spatial.distance.jaccard
, results are:
Jaccard Fast:
CPU times: user 1.28 s, sys: 0 ns, total: 1.28 s
Wall time: 1.28 s
Jaccard Scipy (expanding to dense):
CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 37.2 ms
Jaccard Scipy (using getrow):
CPU times: user 1.82 s, sys: 0 ns, total: 1.82 s
Wall time: 1.81 s
对于如何避免getrow
瓶颈的任何帮助,将不胜感激.由于内存限制,我无法使用todense()
扩展稀疏矩阵.
Any help on how to avoid the getrow
bottleneck would be appreciated. I cannot afford to expand my sparse matrix using todense()
due to memory limitations.
推荐答案
稀疏索引的速度较慢,例如如何读取/遍历/切片Scipy稀疏矩阵(LIL,CSR,COO,DOK)更快?
Sparse indexing is known for being slower, e.g. How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?
In [33]: timeit for row in mat: x=row # sparse iteration
1 loops, best of 3: 510 ms per loop
In [35]: timeit for row in mat.todense(): x=row # dense iteration
10 loops, best of 3: 175 ms per loop
但是我发现使用稀疏矩阵时,您的d.jacard
也较慢
but I find that your d.jacard
is also slower when using sparse matrices
In [36]: ad=a.todense()
In [37]: timeit for row in mat.todense(): d.jaccard(row,ad) # all dense
1 loops, best of 3: 734 ms per loop
In [38]: timeit for row in mat: d.jaccard(row.todense(),ad) # inner dense
1 loops, best of 3: 1.69 s per loop
In [39]: timeit for row in mat: d.jaccard(row,a) # all sparse
1 loops, best of 3: 4.61 s per loop
消除getrow
因素
In [40]: mrow=mat.getrow(0)
In [41]: mrowd=mrow.todense()
In [42]: timeit d.jaccard(mrow, a) # one sparse row
1000 loops, best of 3: 1.32 ms per loop
In [43]: timeit d.jaccard(mrow.todense(), a.todense()) # with conversion
1000 loops, best of 3: 539 µs per loop
In [44]: timeit d.jaccard(mrowd, ad) # dense
10000 loops, best of 3: 173 µs per loop
=====================
======================
我需要重新运行这些测试,因为d.jaccard
不适用于稀疏(而jaccard_fast
不适用于密集).因此,将稀疏行迭代问题从jaccard
计算中分离出来将需要更多的工作.
I need to rerun those tests because d.jaccard
does not work with sparse (and jaccard_fast
does not work with dense). So separating the sparse row iteration issues from the jaccard
calculation will take more work.
我对jaccard_fast
做了一些修改:
def my_jaccard(mat, a):
common = mat*a.T # sparse does the large matrix product well
dis=np.array([(row!=a).getnnz() for row in mat]) # iterative
cA = common.A.ravel()
return 1 - cA/(cA + dis)
它返回在密集行上运行的与d.jaccard
相匹配的值. d.jaccard
对于common
为0的行返回1
.我似乎不需要cA
测试(除非在相同的插槽中cA
和dis
都可能为0)
It returns values that match d.jaccard
run on the dense rows. d.jaccard
returns 1
for the rows where common
is 0. I don't seem to need a cA
test (unless it is possible that both cA
and dis
are 0 at the same slot).
In [141]: r=np.array([d.jaccard(row,ad) for row in mat.todense()])
In [142]: r1=my_jaccard(mat,a)
In [143]: np.allclose(r,r1)
Out[143]: True
,速度只有一半.如果我可以返工,则dis
calc应该具有相似的速度.
and speed is only half. If I can rework the dis
calc is should have similar speed.
In [144]: timeit r=np.array([d.jaccard(row,ad) for row in mat.todense()])
1 loops, best of 3: 783 ms per loop
In [145]: timeit r1=my_jaccard(mat,a)
1 loops, best of 3: 1.29 s per loop
进一步调整计算.我屏蔽了common
的值为0.这有2个目的-它确保我们没有除以0的问题,并且它减少了dis
迭代的次数,从而在速度上有所改善.
A further tweak to the calc. I mask out the common
values that are 0. This has 2 purposes - it ensures we don't have a divide by 0 problem, and it reduces the number of dis
iterations, giving a small speed improvement.
def my_jaccard(mat, a):
common=mat*a.T
cA = common.A.ravel()
mask = cA!=0
cA = cA[mask]
dis = np.array([(row!=a).getnnz() for row, b in zip(mat,mask) if b])
ret = np.ones(mat.shape[0])
ret[mask] = 1 - cA/(cA+dis)
return ret
与此相关的时间减少了一点.
with this the time drops a bit.
In [188]: timeit my_jaccard(mat,a)
1 loops, best of 3: 1.04 s per loop
=================
==================
在这个问题中,我研究了将稀疏矩阵与1行矩阵进行比较,发现使用sparse.kron
复制行是复制numpy
广播的最快方法.
In that question I looked at comparing a sparse matrix with a 1 row matrix, and found that using sparse.kron
to replicate the row, was the fastest way to replicated numpy
broadcasting.
使用jaccard
中的想法来计算dis
数组
Using that idea in jaccard
to calculate the dis
array
def my_jaccard1(mat, a):
common = mat*a.T
cA = common.A.ravel()
aM = sparse.kron(a,np.ones((mat.shape[0],1),int))
dis = (mat!=aM).sum(1)
ret = 1-cA/(cA+dis.A1)
return ret
此计时显着改善(10倍):
With this timings improve significantly (10x):
In [318]: timeit my_jaccard1(mat,a)
1 loops, best of 3: 97.1 ms per loop
我可以像以前一样应用遮罩,以防止被零除;但实际上会减慢计算速度(至140ms).
I can apply masking as before to protect against divide by zero; but it actually slows down the calculation (to 140ms).
def my_jaccard3(mat, a):
common = mat*a.T
cA = common.A.ravel()
mask = cA!=0
cA = cA[mask]
aM = sparse.kron(a,np.ones((len(cA),1),int))
dis = (mat[mask,:]!=aM).sum(1)
ret = np.ones(mat.shape[0])
ret[mask] = 1 - cA/(cA+dis.A1)
return ret
======================
========================
编辑-对可疑案件的测试
edit - test of the suspected case
In [75]: x,y= np.array([1,1,0,0,1,0]), np.array([0,0,1,0,1,0])
In [76]: d.jaccard(x,y)
Out[76]: 0.75
In [78]: jaccard_fast(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[78]: 0.75
我的版本:
In [79]: my_jaccard(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[79]: array([ 0.75])
...
In [82]: my_jaccard3(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[82]: array([ 0.75])
(编辑-明确使用sparse.kron
)
这篇关于getrow()的Scipy稀疏矩阵替代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!