在scipy中按稀疏矩阵分组并返回一个矩阵 [英] Group by sparse matrix in scipy and return a matrix

查看:90
本文介绍了在scipy中按稀疏矩阵分组并返回一个矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于使用稀疏矩阵使用groupby的SO有一些问题.但是输出似乎是列表,字典

There are a few questions on SO dealing with using groupby with sparse matrices. However the output seem to be lists, dictionaries, dataframes and other objects.

我正在处理NLP问题,并希望在处理过程中将所有数据保留在稀疏的矩阵中,以防止出现内存错误.

I'm working on an NLP problem and would like to keep all the data in sparse scipy matrices during processing to prevent memory errors.

这里是上下文:

我已对一些文档进行矢量化处理(此处的示例数据):

I have vectorized some documents (sample data here):

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv('groupbysparsematrix.csv')
docs = df['Text'].tolist()

vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(docs)

print("Dimensions of training set: {0}".format(train_X.shape))
print type(train_X)

Dimensions of training set: (8, 180)
<class 'scipy.sparse.csr.csr_matrix'>

在原始数据框中,我使用日期(以一年中的某天的格式)创建要累加的组:

From the original dataframe I use the date, in a day of the year format, to create the groups I would like to sum over:

from scipy import sparse, hstack    

df['Date'] = pd.to_datetime(df['Date'])
groups = df['Date'].apply(lambda x: x.strftime('%j'))
groups_X = sparse.csr_matrix(groups.astype(float)).T
train_X_all = sparse.hstack((train_X, groups_X))

print("Dimensions of concatenated set: {0}".format(train_X_all.shape))

Dimensions of concatenated set: (8, 181)

现在,我想应用groupby(或类似的函数)来查找每天每个令牌的总和.我希望输出是另一个稀疏的scipy矩阵.

Now I'd like to apply groupby (or a similar function) to find the sum of each token per day. I would like the output to be another sparse scipy matrix.

输出矩阵将为3 x 181,如下所示:

The output matrix would be 3 x 181 and look something like this:

 1, 1, 1, ..., 2, 1, 3
 2, 1, 3, ..., 1, 1, 4
 0, 0, 0, ..., 1, 2, 5

第1到180列代表令牌,第181列代表一年中的日子.

Where the columns 1 to 180 represent the tokens and column 181 represents the day of the year.

推荐答案

计算csr稀疏矩阵的选定列(或行)之和的最佳方法是将一个矩阵乘积与另一个具有1的稀疏矩阵相加,其中你想总结一下.实际上,csr总和(对于整个行或整个列)都是通过矩阵乘积来工作的,而索引行(或列)也可以通过乘积来完成(

The best way of calculating the sum of selected columns (or rows) of a csr sparse matrix is a matrix product with another sparse matrix that has 1's where you want to sum. In fact csr sum (for a whole row or column) works by matrix product, and index rows (or columns) is also done with a product (https://stackoverflow.com/a/39500986/901925)

因此,我将对dates数组进行分组,并使用该信息来构造求和的掩码".

So I'd group the dates array, and use that information to construct the summing 'mask'.

为便于讨论,请考虑以下密集数组:

For sake of discussion, consider this dense array:

In [117]: A
Out[117]: 
array([[0, 2, 7, 5, 0, 7, 0, 8, 0, 7],
       [0, 0, 3, 0, 0, 1, 2, 6, 0, 0],
       [0, 0, 0, 0, 2, 0, 5, 0, 0, 0],
       [4, 0, 6, 0, 0, 5, 0, 0, 1, 4],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 7, 0, 8, 1, 0, 9, 0, 2, 4],
       [9, 0, 8, 4, 0, 0, 0, 0, 9, 7],
       [0, 0, 0, 1, 2, 0, 2, 0, 4, 7],
       [3, 0, 1, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 1, 8, 5, 0, 0, 0, 8, 0]])

制作稀疏副本:

In [118]: M=sparse.csr_matrix(A)

根据最后一列生成一些组; collections.defaultdict是执行此操作的便捷工具:

generate some groups, based on the last column; collections.defaultdict is a convenient tool to do this:

In [119]: grps=defaultdict(list)
In [120]: for i,v in enumerate(A[:,-1]):
     ...:     grps[v].append(i)

In [121]: grps
Out[121]: defaultdict(list, {0: [1, 2, 4, 9], 2: [8], 4: [3, 5], 7: [0, 6, 7]})

我可以遍历这些组,收集M行,对这些行求和并产生:

I can iterate on those groups, collect rows of M, sum across those rows and produce:

In [122]: {k:M[v,:].sum(axis=0) for k, v in grps.items()}
Out[122]: 
{0: matrix([[0, 0, 4, 8, 7, 2, 7, 6, 8, 0]], dtype=int32),
 2: matrix([[3, 0, 1, 0, 0, 0, 0, 0, 0, 2]], dtype=int32),
 4: matrix([[4, 7, 6, 8, 1, 5, 9, 0, 3, 8]], dtype=int32),
 7: matrix([[ 9,  2, 15, 10,  2,  7,  2,  8, 13, 21]], dtype=int32)}

在最后一列中,值包括2 * 4和3 * 7

In the last column, values include 2*4, and 3*7

因此有2个任务-收集组,无论是使用此defaultdict还是itertools.groupby(在这种情况下都需要排序)或pandas groupby.其次,这是行和求和的集合.这个字典迭代在概念上很简单.

So there are 2 tasks - collecting the groups, whether with this defaultdict, or itertools.groupby (which in this case would require sorting), or pandas groupby. And secondly this collection of rows and summing. This dictionary iteration is conceptually simple.

掩蔽矩阵可能会像这样工作:

A masking matrix might work like this:

In [141]: mask=np.zeros((10,10),int)
In [142]: for i,v in enumerate(A[:,-1]): # same sort of iteration
     ...:     mask[v,i]=1
     ...:     
In [143]: Mask=sparse.csr_matrix(mask)
...
In [145]: Mask.A
Out[145]: 
array([[0, 1, 1, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       ....
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
In [146]: (Mask*M).A
Out[146]: 
array([[ 0,  0,  4,  8,  7,  2,  7,  6,  8,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 3,  0,  1,  0,  0,  0,  0,  0,  0,  2],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 4,  7,  6,  8,  1,  5,  9,  0,  3,  8],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 9,  2, 15, 10,  2,  7,  2,  8, 13, 21],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

Mask*M与字典行具有相同的值,但具有额外的0.我可以使用lil格式隔离非零值:

This Mask*M has the same values as the dictionary row, but with the extra 0s. I can isolate the nonzero values with the lil format:

In [147]: (Mask*M).tolil().data
Out[147]: 
array([[4, 8, 7, 2, 7, 6, 8], [], [3, 1, 2], [],
       [4, 7, 6, 8, 1, 5, 9, 3, 8], [], [],
       [9, 2, 15, 10, 2, 7, 2, 8, 13, 21], [], []], dtype=object)

我可以使用coo稀疏输入形式直接构建Mask矩阵:

I can construct the Mask matrix directly using the coo sparse style of input:

Mask = sparse.csr_matrix((np.ones(A.shape[0],int),
    (A[:,-1], np.arange(A.shape[0]))), shape=(A.shape))

那应该更快,并避免出现内存错误(无循环或大型密集数组).

That should be faster and avoid the memory error (no loop or large dense array).

这篇关于在scipy中按稀疏矩阵分组并返回一个矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆