有效地计算稀疏数组的纵列和,其中每一个非零元素是1 [英] Efficiently compute columnwise sum of sparse array where every non-zero element is 1
问题描述
我在的 SciPy的COM pressed稀疏行(CSR)格式。当然大多数元件是零,我还知道,所有非零元素具有1.我要计算在我的矩阵的行的不同子集和的值。目前,我做以下内容:
I have a bunch of data in SciPy compressed sparse row (CSR) format. Of course the majority of elements is zero, and I further know that all non-zero elements have a value of 1. I want to compute sums over different subsets of rows of my matrix. At the moment I am doing the following:
import numpy as np
import scipy as sp
import scipy.sparse
# create some data with sparsely distributed ones
data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
data = sp.sparse.csr_matrix(data, dtype='int8')
# generate column-wise sums over random subsets of rows
nrand = 1000
for k in range(nrand):
inds = np.random.choice(data.shape[0], size=100, replace=False)
# 60% of time is spent here
extracted_rows = data[inds]
# 20% of time is spent here
row_sum = extracted_rows.sum(axis=0)
的最后几行中有一个更大的计算管线中的瓶颈。正如我在code注释的60%的时间都花在切片从随机指标的数据,20%用于计算实际的总和。
The last few lines there are the bottleneck in a larger computational pipeline. As I annotated in the code, 60% of time is spent slicing the data from the random indices, and 20% is spent computing the actual sum.
在我看来,我应该能够使用我的关于数据的知识在阵列中(即在稀疏矩阵任何非零值将是1,其他任何值present)来计算,这些资金更多高效。不幸的是,我无法弄清楚如何。只有交易data.indices
也许?我曾尝试其他稀疏结构(如CSC矩阵),以及转换为密集排列第一,但这些方法均高于该企业社会责任矩阵方法要慢。
It seems to me I should be able to use my knowledge about the data in the array (i.e., any non-zero value in the sparse matrix will be 1; no other values present) to compute these sums more efficiently. Unfortunately, I cannot figure out how. Dealing with just data.indices
perhaps? I have tried other sparsity structures (e.g. CSC matrix), as well as converting to dense array first, but these approaches were all slower than this CSR matrix approach.
推荐答案
这是众所周知的稀疏矩阵的索引是比较慢的。还有对取得围绕通过访问数据,这样的问题直接属性。
It is well known that indexing of sparse matrices is relatively slow. And there have SO questions about getting around that by accessing the data attributes directly.
但首先一定时序。使用数据
和 IND
为你展示我得到
But first some timings. Using data
and ind
as you show I get
In [23]: datad=data.A # times at 3.76 ms per loop
In [24]: timeit row_sumd=datad[inds].sum(axis=0)
1000 loops, best of 3: 529 µs per loop
In [25]: timeit row_sum=data[inds].sum(axis=0)
1000 loops, best of 3: 890 µs per loop
In [26]: timeit d=datad[inds]
10000 loops, best of 3: 55.9 µs per loop
In [27]: timeit d=data[inds]
1000 loops, best of 3: 617 µs per loop
稀疏版本比密集1慢,但不是很多。稀疏索引慢得多,但其总和稍快。
The sparse version is slower than the dense one, but not by a lot. The sparse indexing is much slower, but its sum is somewhat faster.
稀疏的总和与矩阵产品做
The sparse sum is done with a matrix product
def sparse.spmatrix.sum
....
return np.asmatrix(np.ones((1, m), dtype=res_dtype)) * self
这意味着更快的方式 - 把 INDS
到1秒的适当阵列和乘
That suggests that faster way - turn inds
into an appropriate array of 1s and multiply.
In [49]: %%timeit
....: b=np.zeros((1,data.shape[0]),'int8')
....: b[:,inds]=1
....: rowmul=b*data
....:
1000 loops, best of 3: 587 µs per loop
这使得一样快相当于致密一个稀疏的操作。 (但转换为密慢得多)
That makes the sparse operation about as fast as the equivalent dense one. (but converting to dense is much slower)
==================
==================
上一次测试中缺少稀疏的
。但是时代是相似的,其结果都是一样的。 np.asmatrix
是present金额
The last time test is missing the np.asmatrix
that is present in the sparse sum
. But times are similar, and the results are the same
In [232]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x1=np.asmatrix(b)*data
1000 loops, best of 3: 661 µs per loop
In [233]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x2=b*data
1000 loops, best of 3: 605 µs per loop
一产生一矩阵,其它阵列。但两者都在做矩阵产品, B
的第二对朦胧数据1
。尽管
是一个数组,任务实际上是委托给数据
B及矩阵产品 - 在一个不那么透明的一种方式。
One produces a matrix, the other an array. But both are doing a matrix product, 2nd dim of B
against 1st of data
. Even though b
is an array, the task is actually delegated to data
and its matrix product - in a not so transparent a way.
In [234]: x1
Out[234]: matrix([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)
In [235]: x2
Out[235]: array([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)
B * data.A
元素是乘法和引发错误; np.dot(B,data.A)
的作品,但速度较慢。
b*data.A
is element multiplication and raises an error; np.dot(b,data.A)
works but is slower.
较新的 numpy的/蟒蛇
有一个 MATMUL
运营商。我看到相同的时间模式:
Newer numpy/python
has a matmul
operator. I see the same time pattern:
In [280]: timeit b@dataA # dense product
100 loops, best of 3: 2.64 ms per loop
In [281]: timeit b@data.A # slower due to `.A` conversion
100 loops, best of 3: 6.44 ms per loop
In [282]: timeit b@data # sparse product
1000 loops, best of 3: 571 µs per loop
np.dot
也可授权采取行动稀疏
,但你必须要小心。我只是挂在我的机器与 np.dot(csr_matrix(B),data.A)
。
np.dot
may also delegate action to sparse
, though you have to be careful. I just hung my machine with np.dot(csr_matrix(b),data.A)
.
这篇关于有效地计算稀疏数组的纵列和,其中每一个非零元素是1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!