对 scipy.sparse.csr_matrix 中的行求和 [英] Sum over rows in scipy.sparse.csr_matrix

查看:26
本文介绍了对 scipy.sparse.csr_matrix 中的行求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的 csr_matrix,我想添加多行并获得一个具有相同列数但行数减少的新 csr_matrix.(上下文:矩阵是从sklearn CountVectorizer得到的document-term矩阵,我希望能够根据与这些文档关联的代码快速组合文档)

举个简单的例子,这是我的矩阵:

将 numpy 导入为 np从 scipy.sparse 导入 csr_matrix从 scipy.sparse 导入 vstack行 = np.array([0, 4, 1, 3, 2])col = np.array([0, 2, 2, 0, 1])dat = np.array([1, 2, 3, 4, 5])A = csr_matrix((dat, (row, col)), shape=(5, 5))打印 A.toarray()[[1 0 0 0 0][0 0 3 0 0][0 5 0 0 0][4 0 0 0 0][0 0 2 0 0]]

不,假设我想要一个新矩阵 B,其中的行 (1, 4) 和 (2, 3, 5) 通过对它们求和来组合,看起来像这样:

[[5 0 0 0 0][0 5 5 0 0]]

并且应该再次采用稀疏格式(因为我正在处理的真实数据很大).我试图对矩阵的切片求和然后堆叠它:

idx1 = [1, 4]idx2 = [2, 3, 5]A_sub1 = A[idx1, :].sum(axis=1)A_sub2 = A[idx2, :].sum(axis=1)B = vstack((A_sub1, A_sub2))

但是这仅为我提供了切片中非零列的汇总值,因此我无法将其与其他切片合并,因为汇总切片中的列数不同.

我觉得一定有一种简单的方法可以做到这一点.但是我在网上或文档中找不到任何关于此的讨论.我错过了什么?

感谢您的帮助

解决方案

请注意,您可以通过仔细构造另一个矩阵来实现此目的.以下是稠密矩阵的工作方式:

<预><代码>>>>S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])>>>np.dot(S, A.toarray())数组([[5, 0, 0, 0, 0],[0, 5, 5, 0, 0]])>>>

稀疏版本只是稍微复杂一点.row 中编码了哪些行应该相加的信息:

col = range(5)行 = [0, 1, 1, 0, 1]数据 = [1, 1, 1, 1, 1]S = csr_matrix((dat, (row, col)), shape=(2, 5))结果 = S * A# 检查结果是否是另一个稀疏矩阵打印类型(结果)# 检查值是否是我们想要的值打印 result.toarray()

输出:

[[5 0 0 0 0][0 5 5 0 0]]

您可以通过在 row 中包含更高的值并相应地扩展 S 的形状来处理输出中的更多行.

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)

For a minimal example, this is my matrix:

import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack

row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()

[[1 0 0 0 0]
 [0 0 3 0 0]
 [0 5 0 0 0]
 [4 0 0 0 0]
 [0 0 2 0 0]]

No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:

[[5 0 0 0 0]
 [0 5 5 0 0]]

And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:

idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))

But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.

I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?

Thank you for your help

解决方案

Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:

>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
       [0, 5, 5, 0, 0]])
>>>

The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:

col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()

Output:

<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
 [0 5 5 0 0]]

You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.

这篇关于对 scipy.sparse.csr_matrix 中的行求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆