getrow()的Scipy稀疏矩阵替代 [英] Scipy sparse matrix alternative for getrow()

查看:89
本文介绍了getrow()的Scipy稀疏矩阵替代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理大型稀疏二进制矩阵.我使用Scipy稀疏矩阵实现压缩了它们. scipy.spatial.distance中的Jaccard distance的计算不支持对稀疏矩阵的直接运算,因此:

I am working with large sparse binary matrices. I have condensed them using Scipy sparse matrix implementation. The calculation of Jaccard distance from scipy.spatial.distance does not support direct operation on sparse matrices, so either:

  1. 将整个稀疏矩阵转换为密集矩阵,然后在每一行上将其作为向量进行操作,这是饥饿的记忆

遍历稀疏区域,使用getrow()抓取每一行并进行操作.

Loop through the sparse, grab each row using getrow() and operate.

编写我们自己的实现以处理稀疏矩阵.

Write our own implementation to work on sparse matrices.

下面是示例代码:

import scipy.spatial.distance as d
import numpy as np
from scipy.sparse import csr_matrix

# benchmark performance 
X = np.random.random((3000, 3000))
# binarize
X[X > 0.3] = 0
X[X>0] = 1
mat =  csr_matrix(X)

a = np.zeros(3000)
a[4] = a[100] = a[22] =1
a = csr_matrix(a)

def jaccard_fast(v1,v2):
    common = v1.dot(v2.T)
    dis = (v1 != v2).getnnz()
    if common[0,0]:
        return 1.0-float(common[0,0])/float(common[0,0]+dis)
    else:
        return 0.0
    
def benchmark_jaccard_fast():
    for i in range(mat.shape[0]):
        jaccard_fast(mat.getrow(i),a)
        
def benchmark_jaccard_internal_todense():
    for v1,v2 in zip(mat.todense(),a.todense()):
         d.jaccard(v1,v2)
        
def benchmark_jaccard_internal_getrow():
    for i in range(mat.shape[0]):
        d.jaccard(mat.getrow(i),a)
        

print "Jaccard Fast:"
%time benchmark_jaccard_fast()
print "Jaccard Scipy (expanding to dense):"
%time benchmark_jaccard_internal_todense()
print "Jaccard Scipy (using getrow):"
%time benchmark_jaccard_internal_getrow()

其中jaccard_fast是我自己的实现.在scipy稀疏矩阵上,我的实现似乎比内部实现快,但是getrow()似乎使我的实现变慢了.当我将jaccard_fastscipy.spatial.distance.jaccard进行基准比较时,结果是:

where jaccard_fast is my own implementation. It appears that my implementation is faster than the internal one, on scipy sparse matrices, however getrow() seems to slow my implementation down. As I benchmark jaccard_fast against scipy.spatial.distance.jaccard, results are:

Jaccard Fast:
CPU times: user 1.28 s, sys: 0 ns, total: 1.28 s
Wall time: 1.28 s
Jaccard Scipy (expanding to dense):
CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 37.2 ms
Jaccard Scipy (using getrow):
CPU times: user 1.82 s, sys: 0 ns, total: 1.82 s
Wall time: 1.81 s

对于如何避免getrow瓶颈的任何帮助,将不胜感激.由于内存限制,我无法使用todense()扩展稀疏矩阵.

Any help on how to avoid the getrow bottleneck would be appreciated. I cannot afford to expand my sparse matrix using todense() due to memory limitations.

推荐答案

稀疏索引的速度较慢,例如如何读取/遍历/切片Scipy稀疏矩阵(LIL,CSR,COO,DOK)更快?

Sparse indexing is known for being slower, e.g. How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

In [33]: timeit for row in mat: x=row  # sparse iteration
1 loops, best of 3: 510 ms per loop

In [35]: timeit for row in mat.todense(): x=row  # dense iteration
10 loops, best of 3: 175 ms per loop

但是我发现使用稀疏矩阵时,您的d.jacard也较慢

but I find that your d.jacard is also slower when using sparse matrices

In [36]: ad=a.todense()

In [37]: timeit for row in mat.todense(): d.jaccard(row,ad) # all dense
1 loops, best of 3: 734 ms per loop

In [38]: timeit for row in mat: d.jaccard(row.todense(),ad) # inner dense
1 loops, best of 3: 1.69 s per loop

In [39]: timeit for row in mat: d.jaccard(row,a) # all sparse
1 loops, best of 3: 4.61 s per loop

消除getrow因素

In [40]: mrow=mat.getrow(0)

In [41]: mrowd=mrow.todense()

In [42]: timeit d.jaccard(mrow, a)  # one sparse row
1000 loops, best of 3: 1.32 ms per loop

In [43]: timeit d.jaccard(mrow.todense(), a.todense())  # with conversion
1000 loops, best of 3: 539 µs per loop

In [44]: timeit d.jaccard(mrowd, ad)  #  dense
10000 loops, best of 3: 173 µs per loop

=====================

======================

我需要重新运行这些测试,因为d.jaccard不适用于稀疏(而jaccard_fast不适用于密集).因此,将稀疏行迭代问题从jaccard计算中分离出来将需要更多的工作.

I need to rerun those tests because d.jaccard does not work with sparse (and jaccard_fast does not work with dense). So separating the sparse row iteration issues from the jaccard calculation will take more work.

我对jaccard_fast做了一些修改:

def my_jaccard(mat, a):
    common = mat*a.T # sparse does the large matrix product well 
    dis=np.array([(row!=a).getnnz() for row in mat]) # iterative
    cA = common.A.ravel()
    return 1 - cA/(cA + dis)

它返回在密集行上运行的与d.jaccard相匹配的值. d.jaccard对于common为0的行返回1.我似乎不需要cA测试(除非在相同的插槽中cAdis都可能为0)

It returns values that match d.jaccard run on the dense rows. d.jaccard returns 1 for the rows where common is 0. I don't seem to need a cA test (unless it is possible that both cA and dis are 0 at the same slot).

In [141]: r=np.array([d.jaccard(row,ad) for row in mat.todense()])

In [142]: r1=my_jaccard(mat,a)

In [143]: np.allclose(r,r1)
Out[143]: True

,速度只有一半.如果我可以返工,则dis calc应该具有相似的速度.

and speed is only half. If I can rework the dis calc is should have similar speed.

In [144]: timeit r=np.array([d.jaccard(row,ad) for row in mat.todense()])
1 loops, best of 3: 783 ms per loop

In [145]: timeit r1=my_jaccard(mat,a)
1 loops, best of 3: 1.29 s per loop

进一步调整计算.我屏蔽了common的值为0.这有2个目的-它确保我们没有除以0的问题,并且它减少了dis迭代的次数,从而在速度上有所改善.

A further tweak to the calc. I mask out the common values that are 0. This has 2 purposes - it ensures we don't have a divide by 0 problem, and it reduces the number of dis iterations, giving a small speed improvement.

def my_jaccard(mat, a):
    common=mat*a.T
    cA = common.A.ravel()
    mask = cA!=0
    cA = cA[mask]
    dis = np.array([(row!=a).getnnz() for row, b in zip(mat,mask) if b])
    ret = np.ones(mat.shape[0])
    ret[mask] = 1 - cA/(cA+dis)
    return ret

与此相关的时间减少了一点.

with this the time drops a bit.

In [188]: timeit my_jaccard(mat,a)
1 loops, best of 3: 1.04 s per loop

=================

==================

> Python-具有稀疏矩阵的高效函数

在这个问题中,我研究了将稀疏矩阵与1行矩阵进行比较,发现使用sparse.kron复制行是复制numpy广播的最快方法.

In that question I looked at comparing a sparse matrix with a 1 row matrix, and found that using sparse.kron to replicate the row, was the fastest way to replicated numpy broadcasting.

使用jaccard中的想法来计算dis数组

Using that idea in jaccard to calculate the dis array

def my_jaccard1(mat, a):
    common = mat*a.T
    cA = common.A.ravel()
    aM = sparse.kron(a,np.ones((mat.shape[0],1),int))
    dis = (mat!=aM).sum(1)
    ret = 1-cA/(cA+dis.A1)
    return ret    

此计时显着改善(10倍):

With this timings improve significantly (10x):

In [318]: timeit my_jaccard1(mat,a)
1 loops, best of 3: 97.1 ms per loop

我可以像以前一样应用遮罩,以防止被零除;但实际上会减慢计算速度(至140ms).

I can apply masking as before to protect against divide by zero; but it actually slows down the calculation (to 140ms).

def my_jaccard3(mat, a):
    common = mat*a.T
    cA = common.A.ravel()
    mask = cA!=0
    cA = cA[mask]
    aM = sparse.kron(a,np.ones((len(cA),1),int))
    dis = (mat[mask,:]!=aM).sum(1)
    ret = np.ones(mat.shape[0])
    ret[mask] = 1 - cA/(cA+dis.A1) 
    return ret  

======================

========================

编辑-对可疑案件的测试

edit - test of the suspected case

In [75]: x,y= np.array([1,1,0,0,1,0]), np.array([0,0,1,0,1,0])

In [76]: d.jaccard(x,y)
Out[76]: 0.75

In [78]: jaccard_fast(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[78]: 0.75

我的版本:

In [79]: my_jaccard(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[79]: array([ 0.75])
...
In [82]: my_jaccard3(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[82]: array([ 0.75])

(编辑-明确使用sparse.kron)

这篇关于getrow()的Scipy稀疏矩阵替代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆