我应该如何从scipy.sparse.csr.csr_matrix和列表中进行子采样 [英] How should I go about subsampling from a scipy.sparse.csr.csr_matrix and a list
问题描述
我有一个scipy.sparse.csr.csr_matrix
,它代表文档中的单词,还有一个列表列表,其中每个索引代表矩阵中每个索引的类别.
I have a scipy.sparse.csr.csr_matrix
that represents words in a document and a list of lists where each index represents the categories for each index in the matrix.
我遇到的问题是我需要从数据中随机选择N行.
The problem that I am having is that I need to randomly select N amount of rows from the data.
所以,如果我的矩阵看起来像这样
So if my matrix looks like this
[1:3 2:3 4:4]
[1:5 2:5 5:4]
我的列表列表看起来像这样
and my list of lists looked like this
((20,40) (80,50))
我需要采样1个值,我可以得出这样的结论
and I needed to sample 1 value I could end up with this
[1:3 2:3 4:4]
((20,40))
我搜索了scipy文档,但找不到使用索引列表生成新的csr矩阵的方法.
I have searched the scipy documentation but I cannot find a way to generate a new csr matrix using a list of indexes.
推荐答案
您可以使用索引列表来简单地为csr矩阵建立索引.首先,我们创建一个矩阵,然后看一下:
You can simply index a csr matrix by using a list of indices. First we create a matrix, and look at it:
>>> m = csr_matrix([[0,0,1,0], [4,3,0,0], [3,0,0,8]])
<3x4 sparse matrix of type '<type 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
>>> print m.toarray()
[[0 0 1 0]
[4 3 0 0]
[3 0 0 8]]
当然,我们可以轻松地看到第一行:
Of course, we can easily just look a the first row:
>>> m[0]
<1x4 sparse matrix of type '<type 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
>>> print m[0].toarray()
[[0 0 1 0]]
但是我们也可以使用列表[0,2]
作为索引一次查看第一行和第三行:
But we can also look at the first and third row at once using the list [0,2]
as an index:
>>> m[[0,2]]
<2x4 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> print m[[0,2]].toarray()
[[0 0 1 0]
[3 0 0 8]]
现在,您可以使用numpy的choice
生成N
个无重复(无替换)的随机索引:
Now you can generate N
random indices with no repeats (no replacement) using numpy's choice
:
i = np.random.choice(np.arange(m.shape[0]), N, replace=False)
然后,您可以从原始矩阵m
中获取这些索引:
Then you can grab those indices from both your original matrix m
:
sub_m = m[i]
要从列表的类别列表中获取它们,必须首先将其制成数组,然后可以使用列表i
进行索引:
To grab them from your categories list of lists, you must first make it an array, then you can index with the list i
:
sub_c = np.asarray(categories)[i]
如果要返回列表列表,请使用:
If you want to have a list of lists back, just use:
sub_c.tolist()
或者,如果您真正拥有/想要的是元组的元组,我认为您必须手动进行:
or, if what you really have/want is a tuple of tuples, I think you have to do it manually:
tuple(map(tuple, sub_c))
这篇关于我应该如何从scipy.sparse.csr.csr_matrix和列表中进行子采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!