有效地计算稀疏数组的纵列和,其中每一个非零元素是1 [英] Efficiently compute columnwise sum of sparse array where every non-zero element is 1

查看:190
本文介绍了有效地计算稀疏数组的纵列和,其中每一个非零元素是1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在的 SciPy的COM pressed稀疏行(CSR)格式。当然大多数元件是零,我还知道,所有非零元素具有1.我要计算在我的矩阵的行的不同子集和的值。目前,我做以下内容:

I have a bunch of data in SciPy compressed sparse row (CSR) format. Of course the majority of elements is zero, and I further know that all non-zero elements have a value of 1. I want to compute sums over different subsets of rows of my matrix. At the moment I am doing the following:

import numpy as np
import scipy as sp
import scipy.sparse

# create some data with sparsely distributed ones
data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
data = sp.sparse.csr_matrix(data, dtype='int8')

# generate column-wise sums over random subsets of rows
nrand = 1000
for k in range(nrand):
    inds = np.random.choice(data.shape[0], size=100, replace=False)

    # 60% of time is spent here
    extracted_rows = data[inds]

    # 20% of time is spent here
    row_sum = extracted_rows.sum(axis=0)

的最后几行中有一个更大的计算管线中的瓶颈。正如我在code注释的60%的时间都花在切片从随机指标的数据,20%用于计算实际的总和。

The last few lines there are the bottleneck in a larger computational pipeline. As I annotated in the code, 60% of time is spent slicing the data from the random indices, and 20% is spent computing the actual sum.

在我看来,我应该能够使用我的关于数据的知识在阵列中(即在稀疏矩阵任何非零值将是1,其他任何值present)来计算,这些资金更多高效。不幸的是,我无法弄清楚如何。只有交易data.indices 也许?我曾尝试其他稀疏结构(如CSC矩阵),以及转换为密集排列第一,但这些方法均高于该企业社会责任矩阵方法要慢。

It seems to me I should be able to use my knowledge about the data in the array (i.e., any non-zero value in the sparse matrix will be 1; no other values present) to compute these sums more efficiently. Unfortunately, I cannot figure out how. Dealing with just data.indices perhaps? I have tried other sparsity structures (e.g. CSC matrix), as well as converting to dense array first, but these approaches were all slower than this CSR matrix approach.

推荐答案

这是众所周知的稀疏矩阵的索引是比较慢的。还有对取得围绕通过访问数据,这样的问题直接属性。

It is well known that indexing of sparse matrices is relatively slow. And there have SO questions about getting around that by accessing the data attributes directly.

但首先一定时序。使用数据 IND 为你展示我得到

But first some timings. Using data and ind as you show I get

In [23]: datad=data.A  # times at 3.76 ms per loop

In [24]: timeit row_sumd=datad[inds].sum(axis=0)
1000 loops, best of 3: 529 µs per loop

In [25]: timeit row_sum=data[inds].sum(axis=0)
1000 loops, best of 3: 890 µs per loop

In [26]: timeit d=datad[inds]
10000 loops, best of 3: 55.9 µs per loop

In [27]: timeit d=data[inds]
1000 loops, best of 3: 617 µs per loop

稀疏版本比密集1慢,但不是很多。稀疏索引慢得多,但其总和稍快。

The sparse version is slower than the dense one, but not by a lot. The sparse indexing is much slower, but its sum is somewhat faster.

稀疏的总和与矩阵产品做

The sparse sum is done with a matrix product

def sparse.spmatrix.sum
     ....
    return np.asmatrix(np.ones((1, m), dtype=res_dtype)) * self

这意味着更快的方式 - 把 INDS 到1秒的适当阵列和乘

That suggests that faster way - turn inds into an appropriate array of 1s and multiply.

In [49]: %%timeit
   ....: b=np.zeros((1,data.shape[0]),'int8')
   ....: b[:,inds]=1
   ....: rowmul=b*data
   ....: 
1000 loops, best of 3: 587 µs per loop

这使得一样快相当于致密一个稀疏的操作。 (但转换为密慢得多)

That makes the sparse operation about as fast as the equivalent dense one. (but converting to dense is much slower)

==================

==================

上一次测试中缺少稀疏 np.asmatrix 是present金额。但是时代是相似的,其结果都是一样的。

The last time test is missing the np.asmatrix that is present in the sparse sum. But times are similar, and the results are the same

In [232]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x1=np.asmatrix(b)*data
1000 loops, best of 3: 661 µs per loop

In [233]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x2=b*data
1000 loops, best of 3: 605 µs per loop

一产生一矩阵,其它阵列。但两者都在做矩阵产品, B 的第二对朦胧数据1 。尽管 是一个数组,任务实际上是委托给数据 B及矩阵产品 - 在一个不那么透明的一种方式。

One produces a matrix, the other an array. But both are doing a matrix product, 2nd dim of B against 1st of data. Even though b is an array, the task is actually delegated to data and its matrix product - in a not so transparent a way.

In [234]: x1
Out[234]: matrix([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)

In [235]: x2
Out[235]: array([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)

B * data.A 元素是乘法和引发错误; np.dot(B,data.A)的作品,但速度较慢。

b*data.A is element multiplication and raises an error; np.dot(b,data.A) works but is slower.

较新的 numpy的/蟒蛇有一个 MATMUL 运营商。我看到相同的时间模式:

Newer numpy/python has a matmul operator. I see the same time pattern:

In [280]: timeit b@dataA           # dense product
100 loops, best of 3: 2.64 ms per loop

In [281]: timeit b@data.A           # slower due to `.A` conversion
100 loops, best of 3: 6.44 ms per loop

In [282]: timeit b@data             # sparse product
1000 loops, best of 3: 571 µs per loop

np.dot 也可授权采取行动稀疏,但你必须要小心。我只是挂在我的机器与 np.dot(csr_matrix(B),data.A)

np.dot may also delegate action to sparse, though you have to be careful. I just hung my machine with np.dot(csr_matrix(b),data.A).

这篇关于有效地计算稀疏数组的纵列和,其中每一个非零元素是1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆