从scipy稀疏矩阵的每一行中有效地选择随机非零列 [英] Efficiently select random non-zero column from each row of sparse matrix in scipy

查看:116
本文介绍了从scipy稀疏矩阵的每一行中有效地选择随机非零列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为大型稀疏SciPy矩阵的每一行有效地选择一个随机的非零列索引.我似乎无法弄清楚这样做的矢量化方式,所以我求助于一个非常慢的Python循环:

I'm trying to efficiently select a random non-zero column index for each row of a large sparse SciPy matrix. I can't seem to figure out a vectorized way of doing it, so I'm resorting to a very slow Python loop:

random_columns = np.zeros((sparse_matrix.shape[0]))
for i,row in enumerate(sparse_matrix):
    random_columns[i] = (np.random.choice(row.nonzero()[1]))

我的矩阵大约是(4000000,800)csr_matrix,几乎每一行都只有一个非零值,因此Python循环会降低性能.必须有更好的方法!

My matrix is an approximately (4000000, 800) csr_matrix with almost every row having only one non-zero value, so the Python loop is killing performance. There has to be a better way!

编辑,我可以通过直接访问csr_matrix的基础数据使速度快2倍:

EDIT I can make it about 2x faster by directly accessing the underlying data of the csr_matrix:

random_columns[i] = row.indices[np.random.choice(len(row.data))]

推荐答案

您是否查看了此格式以及其他稀疏格式的基础数据表示形式?

Have you looked at the underlying data representation for this, and other sparse formats?

例如,对于小矩阵

In [257]: M = sparse.rand(10,10,.1,format='csr')

In [258]: M
Out[258]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

In [259]: M.data
Out[259]: 
array([ 0.86390256,  0.85244302,  0.88549326,  0.78737361,  0.99918561,
        0.89862529,  0.86842524,  0.25714778,  0.4174032 ,  0.33137501])

In [260]: M.indices
Out[260]: array([1, 5, 8, 8, 9, 0, 3, 9, 4, 5], dtype=int32)

In [261]: M.indptr
Out[261]: array([ 0,  1,  1,  3,  4,  4,  5,  5,  7,  8, 10], dtype=int32)

对于csr,索引有些模糊.确切地说,每个非零值的列索引都出现在M.indices中,但是要花点时间才能确定哪些索引属于哪一行.

For csr the indices are a bit obscure. Or rather, the column index for each nonzero value is present in M.indices, but it takes a bit of calculation to determine which ones belong to which row.

对于其他格式,连接更加明显.

For other formats the connection is more obvious.

对于lil,我们有2个列表列表

For lil we have 2 lists of lists

In [262]: Ml=M.tolil()

In [263]: Ml.data
Out[263]: 
array([[0.863902562935336], [], [0.8524430195076207, 0.8854932609233054],
       [0.7873736126927198], [], [0.9991856090158101], [],
       [0.8986252926235274, 0.8684252408594123], [0.2571477751356357],
       [0.4174032029993796, 0.3313750148434619]], dtype=object)

In [264]: Ml.rows
Out[264]: array([[1], [], [5, 8], [8], [], [9], [], [0, 3], [9], [4, 5]], dtype=object)

In [265]: [np.random.choice(x) for x in Ml.rows if x]
# some rows might not have any nonzero
Out[265]: [1, 5, 8, 9, 3, 9, 5]

In [268]: [np.random.choice(x.nonzero()[1]) for x in M if len(x.nonzero()[1])]
Out[268]: [1, 5, 8, 9, 0, 9, 4]

您也可以将nonzero用于整个矩阵

You can also take nonzero for the whole matrix

 In [274]: M.nonzero()
 Out[274]: 
 (array([0, 2, 2, 3, 5, 7, 7, 8, 9, 9], dtype=int32),
 array([1, 5, 8, 8, 9, 0, 3, 9, 4, 5], dtype=int32))

这些是与M.tocoo()和查看rowcol属性的数组相同的数组.从理论上讲,您可以使用groupby获取列的子列表,然后从中进行选择.但同样,您有列表或生成器以及迭代.

These are the same arrays that you would get with M.tocoo() and looking at the row and col attributes. In theory you could use groupby to get sublists of columns, and pick from those. But again you have lists or generators and iteration.

我不知道这些表示形式中的任何一个是否更快.

I don't know if anyone of these representations is faster or not.

将问题向量化可能有一些限制.非零的数目(choices的输入)将逐行变化.有些行包含非,其他行包含1个或更多.每当您遇到长度不同的数组或列表时,都很难对操作进行矢量化处理.如果无法将值排列到常规2d数组中,则无法通过数组操作将它们作为一个整体进行操作.

There may be some limits to vectorizing the problem. The number of nonzeros (the inputs to choices) will differ by row. Some rows have non, others 1 or more. Anytime you encounter arrays or lists of different lengths, it is difficult to vectorize the operation. If you can't arrange the values into a regular 2d array you can't operate on them as a whole with array operations.

lil格式值得一看:

In [276]: timeit [np.random.choice(x.nonzero()[1]) for x in M if len(x.nonzero()[1])]
100 loops, best of 3: 4.24 ms per loop

In [289]: timeit [np.random.choice(row.indices) for row in M if len(row.indices)]
1000 loops, best of 3: 1.52 ms per loop
# 3x speedup using row.indices

In [277]: %%timeit
   .....: Ml=M.tolil()
   .....: [np.random.choice(x) for x in Ml.rows if x]
   .....: 
10000 loops, best of 3: 181 µs per loop

这篇关于从scipy稀疏矩阵的每一行中有效地选择随机非零列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆