我应该如何从scipy.sparse.csr.csr_matrix和列表中进行子采样 [英] How should I go about subsampling from a scipy.sparse.csr.csr_matrix and a list

查看:619
本文介绍了我应该如何从scipy.sparse.csr.csr_matrix和列表中进行子采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个scipy.sparse.csr.csr_matrix,它代表文档中的单词,还有一个列表列表,其中每个索引代表矩阵中每个索引的类别.

I have a scipy.sparse.csr.csr_matrix that represents words in a document and a list of lists where each index represents the categories for each index in the matrix.

我遇到的问题是我需要从数据中随机选择N行.

The problem that I am having is that I need to randomly select N amount of rows from the data.

所以,如果我的矩阵看起来像这样

So if my matrix looks like this

[1:3 2:3 4:4]
[1:5 2:5 5:4]

我的列表列表看起来像这样

and my list of lists looked like this

((20,40) (80,50))  

我需要采样1个值,我可以得出这样的结论

and I needed to sample 1 value I could end up with this

[1:3 2:3 4:4]
((20,40))

我搜索了scipy文档,但找不到使用索引列表生成新的csr矩阵的方法.

I have searched the scipy documentation but I cannot find a way to generate a new csr matrix using a list of indexes.

推荐答案

您可以使用索引列表来简单地为csr矩阵建立索引.首先,我们创建一个矩阵,然后看一下:

You can simply index a csr matrix by using a list of indices. First we create a matrix, and look at it:

>>> m = csr_matrix([[0,0,1,0], [4,3,0,0], [3,0,0,8]])
<3x4 sparse matrix of type '<type 'numpy.int64'>'
    with 5 stored elements in Compressed Sparse Row format>

>>>  print m.toarray()
[[0 0 1 0]
 [4 3 0 0]
 [3 0 0 8]]

当然,我们可以轻松地看到第一行:

Of course, we can easily just look a the first row:

>>> m[0]
<1x4 sparse matrix of type '<type 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>

>>> print m[0].toarray()
[[0 0 1 0]]

但是我们也可以使用列表[0,2]作为索引一次查看第一行和第三行:

But we can also look at the first and third row at once using the list [0,2] as an index:

>>> m[[0,2]]
<2x4 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

>>> print m[[0,2]].toarray()
[[0 0 1 0]
 [3 0 0 8]]

现在,您可以使用numpy的choice生成N个无重复(无替换)的随机索引:

Now you can generate N random indices with no repeats (no replacement) using numpy's choice:

i = np.random.choice(np.arange(m.shape[0]), N, replace=False)

然后,您可以从原始矩阵m中获取这些索引:

Then you can grab those indices from both your original matrix m:

sub_m = m[i]

要从列表的类别列表中获取它们,必须首先将其制成数组,然后可以使用列表i进行索引:

To grab them from your categories list of lists, you must first make it an array, then you can index with the list i:

sub_c = np.asarray(categories)[i]

如果要返回列表列表,请使用:

If you want to have a list of lists back, just use:

sub_c.tolist()

或者,如果您真正拥有/想要的是元组的元组,我认为您必须手动进行:

or, if what you really have/want is a tuple of tuples, I think you have to do it manually:

tuple(map(tuple, sub_c))

这篇关于我应该如何从scipy.sparse.csr.csr_matrix和列表中进行子采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆