遍历按列排序的coo_matrix元素的有效方法? [英] efficient way to iterate through coo_matrix elements ordered by column?
问题描述
我有一个scipy.sparse.coo_matrix
矩阵,我想将其转换为每列的位集以进行进一步的计算. (出于示例的目的,我正在100Kx1M上进行测试).
I have a scipy.sparse.coo_matrix
matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M).
我目前正在做这样的事情:
I'm currently doing something like this:
bitsets = [ intbitset() for _ in range(matrix.shape[1]) ]
for i,j in itertools.izip(matrix.row, matrix.col):
bitsets[j].add(i)
可以,但是COO矩阵按行迭代值.理想情况下,我想按列进行迭代,然后立即构建位集,而不是每次都添加到不同的位集.
That works, but COO matrix iterates the values by row. Ideally, I'd like to iterate by columns and then just build the bitset at once instead of adding to a different bitset every time.
找不到一种方法来迭代基于列的矩阵.有吗?
Couldn't find a way to iterate the matrix column-based. Is there?
我不介意转换为其他稀疏格式,但是找不到有效的方法来在那里迭代矩阵. (事实证明,在CSC矩阵上使用nonzero()
效率极低...)
I don't mind converting to other sparse formats, but couldn't find a way to efficiently iterate the matrix there. (using nonzero()
on CSC matrix has been proven to be extremely not efficient...)
谢谢!
推荐答案
制作一个小的稀疏矩阵:
Make a small sparse matrix:
In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [84]: print(M)
(1, 3) 0.03079661961875302
(0, 2) 0.722023291734881
(0, 3) 0.547594065264775
(1, 0) 1.1021150713641839
(1, 2) 0.585848976928308
该print
以及nonzero
返回row
和col
数组:
In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))
到csr
的转换对行进行排序(但不一定对列进行排序). nonzero
转换回coo
,并以新顺序返回行和列.
Conversion to csr
orders the rows (but not necessarily the columns). nonzero
converts back to coo
and returns the row and col, with the new order.
In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
我要说的是转换为csc
会对各列进行排序,但看起来并非如此:
I was going to say conversion to csc
orders the columns, but it doesn't look like that:
In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
csr的转置会产生一个csc:
Transpose of csr produces a csc:
In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))
我没有完全了解您要执行的操作或为什么要进行列排序,但是lil
格式可能会有所帮助:
I don't fully follow what you are trying to do, or why you want a column sort, but the lil
format might help:
In [90]: M.tolil().rows
Out[90]:
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
dtype=object)
In [91]: M.tolil().T.rows
Out[91]:
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
dtype=object)
通常,在稀疏矩阵上的迭代速度很慢. csr
和csc
格式的矩阵乘法是最快的操作.还有许多其他操作间接使用该操作(例如,行总和).另一组相对较快的操作是可以直接与data
属性一起使用的那些操作,而无需关注行或列的值.
In general iteration on sparse matrices is slow. Matrix multiplication in the csr
and csc
formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data
attribute, without paying attention to row or column values.
coo
不实现索引或迭代. csr
和lil
实现这些功能.
coo
doesn't implement indexing or iteration. csr
and lil
implement those.
这篇关于遍历按列排序的coo_matrix元素的有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!