在 scipy 稀疏矩阵的一行中查找前 n 个值 [英] Finding the top n values in a row of a scipy sparse matrix
问题描述
我有一个 CSR 格式的 scipy 稀疏矩阵.它是 72665x72665,因此将此矩阵转换为密集矩阵以对其执行操作是不切实际的(此矩阵的密集表示类似于 40 gig).该矩阵是对称的,并且有大约 8200 万个非零条目 (~1.5%).
I have a scipy sparse matrix in CSR format. It's 72665x72665 so it's impractical to convert this matrix to a dense matrix to perform operations on (the dense representation of this matrix is like 40 gigs). The matrix is symmetric, and has about 82 million non-zero entries (~1.5%).
我希望能够做的是,对于每一行,我想获得最大 N 值的索引.如果这是一个 numpy 数组,我会使用 np.argpartition
这样做:
What I would like to be able to do is, for each row, I want to get the indices of the largest N values. If this were a numpy array, I would use np.argpartition
to do it like so:
for row in matrix:
top_n_idx = np.argpartition(row,-n)[-n:]
对于稀疏矩阵,我可以做类似的事情吗?
Is there something similar to this I can do for a sparse matrix?
推荐答案
改进@Paul Panzer 的解决方案.现在它可以处理任何行的值小于 n 的情况.
Improve from @Paul Panzer's solution. Now it can handle the case when any row has less than n values.
def top_n_idx_sparse(matrix, n):
'''Return index of top n values in each row of a sparse matrix'''
top_n_idx = []
for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
n_row_pick = min(n, ri - le)
top_n_idx.append(matrix.indices[le + np.argpartition(matrix.data[le:ri], -n_row_pick)[-n_row_pick:]])
return top_n_idx
这篇关于在 scipy 稀疏矩阵的一行中查找前 n 个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!