由scipy稀疏矩阵组成 [英] group by on scipy sparse matrix

查看:226
本文介绍了由scipy稀疏矩阵组成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有10e6行和10e3列的scipy稀疏矩阵,填充到1%。我还有一个大小为10e6的数组,其中包含对应于我的稀疏矩阵的10e6行的键。我想按照这些键对我的稀疏矩阵进行分组,然后使用sum函数进行聚合。



示例:

<$ p $
'''''''''''''''''''''''''''''''''''''$'

稀疏矩阵:
(0,1)3 - >对应于第一个'foo'键
(0,10)4 - >对应于第一个bar键
(2,1)1 - >对应于第二'foo'键
(1,3)2 - >对应于第一个'baz'键
(2,3)10 - >对应于第二个'baz'键
(2,4)1 - >对应于第二个'bar'键

预期结果:
{
'foo':{1:4}, - > 4 = 3 + 1
'bar':{4:1,10:4},
'baz':{3:12} - > 12 = 2 + 10
}

更有效的方法是什么?



我已经尝试在我的稀疏矩阵上使用 pandas.SparseSeries.from_coo ,以便能够使用pandas组但是我得到了这个已知的bug:

$ _
$ b

  site_packages / pandas / tools / merge.py in __init __(self,objs,如果不是isinstance(obj,NDFrame):
- >
863代表objs中的obj:
864如果不是isinstance(obj,NDFrame):
- > 865 raise TypeError(无法连接非NDFrame对象)
866
867#合并

TypeError:无法连接非NDFrame对象

解决方案

我可以使用基本字典和列表操作生成目标:

  keys = ['foo','bar','foo','baz','baz','bar'] 
rows = [0,0,2,1,2,2]; COLS = [1,10,1,3,3,4];数据= [3,4,1,2,10,1]
dd = {}
对于枚举(键)中的i,k:
d1 = dd.get(k,{ })
v = d1.get(cols [i],0)
d1 [cols [i]] = v + data [i]
dd [k] = d1
打印dd

制作

  {'baz':{3:12},'foo':{1:4},'bar':{10:4,4:1}} 





$ b我可以从这个数据生成一个稀疏矩阵以及: > import numpy as np
from scipy import sparse
M = sparse.coo_matrix((data,(rows,cols)))
print M
print
Md = M.todok()
print Md

但请注意,条款的顺序不固定。在 coo 中,订单与输入一致,但更改格式和订单更改。换句话说,和稀疏矩阵的元素之间的匹配是未指定的。

 (0,10)4 
(2,1)1
(1,3)2
(2,3)10
(2,4)1

(0,1)3
(1,3)2
(2,1)1
(2 ,3)10
(0,10)4
(2,4)1

在清除此映射之前,最初的字典方法是最好的。


I have a scipy sparse matrix with 10e6 rows and 10e3 columns, populated to 1%. I also have an array of size 10e6 which contains keys corresponding to the 10e6 rows of my sparse matrix. I want to group my sparse matrix following these keys and aggregate with a sum function.

Example:

Keys:
['foo','bar','foo','baz','baz','bar']

Sparse matrix:
(0,1) 3              -> corresponds to the first 'foo' key
(0,10) 4             -> corresponds to the first 'bar' key
(2,1) 1              -> corresponds to the second 'foo' key
(1,3) 2              -> corresponds to the first 'baz' key
(2,3) 10             -> corresponds to the second 'baz' key
(2,4) 1              -> corresponds to the second 'bar' key

Expected result:
{
    'foo': {1: 4},               -> 4 = 3 + 1
    'bar': {4: 1, 10: 4},        
    'baz': {3: 12}               -> 12 = 2 + 10
}

What is the more efficient way to do it?

I already tried to use pandas.SparseSeries.from_coo on my sparse matrix in order to be able to use pandas group by but I get this known bug:

site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    863         for obj in objs:
    864             if not isinstance(obj, NDFrame):
--> 865                 raise TypeError("cannot concatenate a non-NDFrame object")
    866 
    867             # consolidate

TypeError: cannot concatenate a non-NDFrame object

解决方案

I can generate your target with basic dictionary and list operations:

keys = ['foo','bar','foo','baz','baz','bar']
rows = [0,0,2,1,2,2]; cols=[1,10,1,3,3,4]; data=[3,4,1,2,10,1]
dd = {}
for i,k in enumerate(keys):
    d1 = dd.get(k, {})
    v = d1.get(cols[i], 0)
    d1[cols[i]] = v + data[i]
    dd[k] = d1
print dd

producing

{'baz': {3: 12}, 'foo': {1: 4}, 'bar': {10: 4, 4: 1}}

I can generate a sparse matrix from this data as well with:

import numpy as np
from scipy import sparse
M = sparse.coo_matrix((data,(rows,cols)))
print M
print
Md = M.todok()
print Md

But notice that the order of terms is not fixed. In the coo the order is as entered, but change format and the order changes. In other words the match between keys and the elements of the sparse matrix is unspecified.

  (0, 1)    3
  (0, 10)   4
  (2, 1)    1
  (1, 3)    2
  (2, 3)    10
  (2, 4)    1

  (0, 1)    3
  (1, 3)    2
  (2, 1)    1
  (2, 3)    10
  (0, 10)   4
  (2, 4)    1

Until you clear up this mapping, the initial dictionary approach is best.

这篇关于由scipy稀疏矩阵组成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆