遍历按列排序的coo_matrix元素的有效方法? [英] efficient way to iterate through coo_matrix elements ordered by column?

查看:75
本文介绍了遍历按列排序的coo_matrix元素的有效方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个scipy.sparse.coo_matrix矩阵,我想将其转换为每列的位集以进行进一步的计算. (出于示例的目的,我正在100Kx1M上进行测试).

I have a scipy.sparse.coo_matrix matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M).

我目前正在做这样的事情:

I'm currently doing something like this:

bitsets = [ intbitset() for _ in range(matrix.shape[1]) ]
for i,j in itertools.izip(matrix.row, matrix.col):
  bitsets[j].add(i)

可以,但是COO矩阵按行迭代值.理想情况下,我想按列进行迭代,然后立即构建位集,而不是每次都添加到不同的位集.

That works, but COO matrix iterates the values by row. Ideally, I'd like to iterate by columns and then just build the bitset at once instead of adding to a different bitset every time.

找不到一种方法来迭代基于列的矩阵.有吗?

Couldn't find a way to iterate the matrix column-based. Is there?

我不介意转换为其他稀疏格式,但是找不到有效的方法来在那里迭代矩阵. (事实证明,在CSC矩阵上使用nonzero()效率极低...)

I don't mind converting to other sparse formats, but couldn't find a way to efficiently iterate the matrix there. (using nonzero() on CSC matrix has been proven to be extremely not efficient...)

谢谢!

推荐答案

制作一个小的稀疏矩阵:

Make a small sparse matrix:

In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [84]: print(M)
  (1, 3)    0.03079661961875302
  (0, 2)    0.722023291734881
  (0, 3)    0.547594065264775
  (1, 0)    1.1021150713641839
  (1, 2)    0.585848976928308

print以及nonzero返回rowcol数组:

In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))

csr的转换对行进行排序(但不一定对列进行排序). nonzero转换回coo,并以新顺序返回行和列.

Conversion to csr orders the rows (but not necessarily the columns). nonzero converts back to coo and returns the row and col, with the new order.

In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))

我要说的是转换为csc会对各列进行排序,但看起来并非如此:

I was going to say conversion to csc orders the columns, but it doesn't look like that:

In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))

csr的转置会产生一个csc:

Transpose of csr produces a csc:

In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))

我没有完全了解您要执行的操作或为什么要进行列排序,但是lil格式可能会有所帮助:

I don't fully follow what you are trying to do, or why you want a column sort, but the lil format might help:

In [90]: M.tolil().rows
Out[90]: 
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
      dtype=object)
In [91]: M.tolil().T.rows
Out[91]: 
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
      dtype=object)

通常,在稀疏矩阵上的迭代速度很慢. csrcsc格式的矩阵乘法是最快的操作.还有许多其他操作间接使用该操作(例如,行总和).另一组相对较快的操作是可以直接与data属性一起使用的那些操作,而无需关注行或列的值.

In general iteration on sparse matrices is slow. Matrix multiplication in the csr and csc formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data attribute, without paying attention to row or column values.

coo不实现索引或迭代. csrlil实现这些功能.

coo doesn't implement indexing or iteration. csr and lil implement those.

这篇关于遍历按列排序的coo_matrix元素的有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆