Scipy.sparse.csr_matrix:如何获得前十个值和索引? [英] Scipy.sparse.csr_matrix: How to get top ten values and indices?

查看:22
本文介绍了Scipy.sparse.csr_matrix:如何获得前十个值和索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的 csr_matrix,我对前十个值及其每行的索引感兴趣.但是我没有找到一个像样的方法来操作矩阵.

I have a large csr_matrix and I am interested in the top ten values and their indices each row. But I did not find a decent way to manipulate the matrix.

这是我目前的解决方案,主要思想是逐行处理它们:

Here is my current solution and the main idea is to process them row by row:

row = csr_matrix.getrow(row_number).toarray()[0].ravel()
top_ten_indicies = row.argsort()[-10:]
top_ten_values = row[row.argsort()[-10:]]

通过这样做,csr_matrix 的优势没有得到充分利用.这更像是一个蛮力解决方案.

By doing this, the advantages of csr_matrix is not fully used. It's more like a brute force solution.

推荐答案

我不明白 csr 格式在这种情况下有什么优势.当然,所有非零值都收集在一个 .data 数组中,相应的列索引在 .indices 中.但它们位于不同长度的块中.这意味着它们不能并行处理或使用 numpy 数组步幅处理.

I don't see what the advantages of csr format are in this case. Sure, all the nonzero values are collected in one .data array, with the corresponding column indexes in .indices. But they are in blocks of varying length. And that means they can't be processed in parallel or with numpy array strides.

一种解决方案是将这些块填充为公共长度的块.这就是 .toarray() 所做的.然后你可以用 argsort(axis=1) 或 argpartition` 找到最大值.

One solution is the pad those blocks into common length blocks. That's what .toarray() does. Then you can find the maximum values with argsort(axis=1) or withargpartition`.

另一种方法是将它们分成行大小的块,并处理每个块.这就是您对 .getrow 所做的.另一种分解它们的方法是转换为 lil 格式,并处理 .data.rows 数组的子列表.

Another is to break them into row sized blocks, and process each of those. That's what you are doing with the .getrow. Another way of breaking them up is convert to lil format, and process the sublists of the .data and .rows arrays.

第三种可能的选择是使用 ufunc reduceat 方法.这使您可以将 ufunc reduction 方法应用于数组的连续块.已经建立的 ufuncnp.add 就是利用了这一点.argsort 不是这样的函数.但是有一种方法可以从 Python 函数构造 ufunc,并且比常规 Python 迭代获得一些适度的速度.[我需要查找最近的一个 SO 问题来说明这一点.]

A possible third option is to use the ufunc reduceat method. This lets you apply ufunc reduction methods to sequential blocks of an array. There are established ufunc like np.add that take advantage of this. argsort is not such a function. But there is a way of constructing a ufunc from a Python function, and gain some modest speed over regular Python iteration. [I need to look up a recent SO question that illustrates this.]

我将用一个更简单的函数来说明其中的一些,对行求和.

I'll illustrate some of this with a simpler function, sum over rows.

如果 A2 是 csr 矩阵.

If A2 is a csr matrix.

A2.sum(axis=1)  # the fastest compile csr method
A2.A.sum(axis=1)  # same, but with a dense intermediary
[np.sum(l.data) for l in A2]  # iterate over the rows of A2
[np.sum(A2.getrow(i).data) for i in range(A2.shape[0])]  # iterate with index
[np.sum(l) for l in A2.tolil().data]  # sum the sublists of lil format
np.add.reduceat(A2.data, A2.indptr[:-1])  # with reduceat

A2.sum(axis=1) 实现为矩阵乘法.这与排序问题无关,但仍然是查看求和问题的有趣方式.请记住,csr 格式是为高效乘法而开发的.

A2.sum(axis=1) is implemented as a matrix multiplication. That's not relevant to the sort problem, but still an interesting way of looking at the summation problem. Remember csr format was developed for efficient multiplication.

对于我当前的样本矩阵(为另一个 SO 稀疏问题创建)

For a my current sample matrix (created for another SO sparse question)

<8x47752 sparse matrix of type '<class 'numpy.float32'>'
     with 32 stored elements in Compressed Sparse Row format>

一些比较时间是

In [694]: timeit np.add.reduceat(A2.data, A2.indptr[:-1])
100000 loops, best of 3: 7.41 µs per loop

In [695]: timeit A2.sum(axis=1)
10000 loops, best of 3: 71.6 µs per loop

In [696]: timeit [np.sum(l) for l in A2.tolil().data]
1000 loops, best of 3: 280 µs per loop

其他一切都是 1 毫秒或更长时间.

Everything else is 1ms or more.

我建议专注于开发单行功能,例如:

I suggest focusing on developing your one-row function, something like:

def max_n(row_data, row_indices, n):
    i = row_data.argsort()[-n:]
    # i = row_data.argpartition(-n)[-n:]
    top_values = row_data[i]
    top_indices = row_indices[i]  # do the sparse indices matter?
    return top_values, top_indices, i

然后看看 if 是否适合这些迭代方法之一.tolil() 看起来最有前途.

Then see how if fits in one of these iteration methods. tolil() looks most promising.

我还没有解决如何收集这些结果的问题.它们应该是列表列表、10 列数组、另一个每行 10 个值的稀疏矩阵等吗?

I haven't addressed the question of how to collect these results. Should they be lists of lists, array with 10 columns, another sparse matrix with 10 values per row, etc.?

对每一行进行排序大稀疏&保存前 K 值 &列索引 - 几年前的类似问题,但未得到解答.

sorting each row of a large sparse & saving top K values & column index - Similar question from several years back, but unanswered.

scipy sparse 中每行或每列的 Argmax矩阵 - 为 csr 的行寻找 argmax 的最近问题.我讨论了一些相同的问题.

Argmax of each row or column in scipy sparse matrix - Recent question seeking argmax for rows of csr. I discuss some of the same issues.

如何在 numpy 中加速循环? - 如何使用 np.frompyfunc 创建 ufunc 的示例.不知道结果函数有没有 .reduceat 方法.

how to speed up loop in numpy? - example of how to use np.frompyfunc to create a ufunc. I don't know if the resulting function has the .reduceat method.

增加前 k 个元素的值稀疏矩阵 - 获取 csr 的前 k 个元素(不是按行).argpartition 的大小写.

Increasing value of top k elements in sparse matrix - get the top k elements of csr (not by row). Case for argpartition.

np.frompyfunc实现的行求和:

In [741]: def foo(a,b):
    return a+b  
In [742]: vfoo=np.frompyfunc(foo,2,1)
In [743]: timeit vfoo.reduceat(A2.data,A2.indptr[:-1],dtype=object).astype(float)
10000 loops, best of 3: 26.2 µs per loop

这是可观的速度.但我想不出一种编写二元函数(需要 2 个参数)的方法,该函数将通过约简实现 argsort.所以这可能是这个问题的死胡同.

That's respectable speed. But I can't think of a way of writing a binary function (takes to 2 arguments) that would implement argsort via reduction. So this is probably a deadend for this problem.

这篇关于Scipy.sparse.csr_matrix:如何获得前十个值和索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆