如何从稀疏矩阵中选择一些行然后使用它们形成一个新的稀疏矩阵 [英] How to select some rows from sparse matrix then use them form a new sparse matrix

查看:58
本文介绍了如何从稀疏矩阵中选择一些行然后使用它们形成一个新的稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的稀疏矩阵(100000 列和 100000 行).我想选择这个稀疏矩阵的一些行,然后使用它们来形成一个新的稀疏矩阵.我尝试通过首先将它们转换为稠密矩阵然后再次将它们转换为稀疏矩阵来做到这一点.但是当我这样做时,python 会引发内存错误".然后我尝试了另一种方法,即我选择稀疏矩阵的行,然后将它们放入一个数组中,但是当我尝试将此数组转换为稀疏矩阵时,它说:'ValueError: The truth value of an array with more than一个元素是不明确的.使用 a.any() 或 a.all().'那么如何将这个列表稀疏矩阵转换为一个大的稀疏矩阵呢?

I have a very large sparse matrix(100000 column and 100000 rows). I want to select some of the rows of this sparse matrix and then use them to form a new sparse matrix. I tried to do it by first converting them to dense matrix and then convert them to sparse matrix again. But when I do this python raise a 'Memory error'. Then I tried another method, which is I select the rows of sparse matrix and then put them into a array, but when I try to convert this array to sparse matrix, it says: 'ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().' So how can I transform this list sparse matrix to a single big sparse matrix?

# X_train is a sparse matrix of size 100000x100000, it is in sparse form
# y_train is a 1 denmentional array with length 100000
# I try to get a new sparse matrix by using some rows of X_train, the 
#selection criteria is sum of the sparse row = 0

#y_train_new = []
#X_train_new = []
for i in range(len(y_train)):
    if np.sum(X_train[i].toarray()[0]) == 0:
        X_train_new.append(X_train[i])
        y_train_new.append(y_train[i])

当我这样做时:

X_train_new = scipy.sparse.csr_matrix(X_train_new)

我收到错误消息:

'ValueError: The truth value of an array with more than one element is 
ambiguous. Use a.any() or a.all().'

推荐答案

我添加了一些标签,可以帮助我更快地看到您的问题.

I added some tags that would have helped me see your question sooner.

在询问错误时,最好提供部分或全部回溯,以便我们了解错误发生的位置.问题函数调用的输入信息也有帮助.

When asking about an error, it's a good idea to provide some or all of the traceback, so we can see where the error is occuring. Information on the inputs to the problem function call can also help.

幸运的是,我可以很容易地重现这个问题 - 并且在一个合理大小的示例中.不用做一个没人看的100000 x10000矩阵!

Fortunately I can recreate the problem fairly easily - and in a reasonable size example. No need to make a 100000 x10000 matrix that no one can look at!

制作一个中等大小的稀疏矩阵:

Make a modest size sparse matrix:

In [126]: M = sparse.random(10,10,.1,'csr')                                                              
In [127]: M                                                                                              
Out[127]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

我可以对整个矩阵进行行求和,就像使用密集数组一样.稀疏代码实际上使用矩阵向量乘法来实现这一点,从而产生一个密集矩阵.

I can do a whole matrix row sum, just as with a dense array. The sparse code actually uses matrix-vector multiplication to do this, producing a dense matrix.

In [128]: M.sum(axis=1)                                                                                  
Out[128]: 
matrix([[0.59659958],
        [0.80390719],
        [0.37251645],
        [0.        ],
        [0.85766909],
        [0.42267366],
        [0.76794737],
        [0.        ],
        [0.83131054],
        [0.46254634]])

它足够稀疏,以至于某些行没有零.对于浮点数,尤其是在 0-1 范围内,我不会得到非零值抵消的行.

It's sparse enough so that some rows have no zeros. With floats, especially in the 0-1 range, I'm not going to get rows where the nonzero values cancel out.

或者使用逐行计算:

In [133]: alist = [np.sum(row.toarray()[0]) for row in M]                                                
In [134]: alist                                                                                          
Out[134]: 
[0.5965995802776853,
 0.8039071870427961,
 0.37251644566924424,
 0.0,
 0.8576690924353791,
 0.42267365715276595,
 0.7679473651419432,
 0.0,
 0.8313105376003095,
 0.4625463360625408]

并选择总和为零的行(在本例中为空行):

And selecting the rows that do sum to zero (in this case empty ones):

In [135]: alist = [row for row in M if np.sum(row.toarray()[0])==0]                                      
In [136]: alist                                                                                          
Out[136]: 
[<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
 <1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>]

请注意,这是一个稀疏矩阵列表.这也是你得到的,对吧?

Note that this is a list of sparse matrices. That's what you got too, right?

现在,如果我尝试从中制作矩阵,我会收到您的错误:

Now if I try to make matrix from that, I get your error:

In [137]: sparse.csr_matrix(alist)                                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-137-5e20e6fc2524> in <module>
----> 1 sparse.csr_matrix(alist)

/usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     86                                  "".format(self.format))
     87             from .coo import coo_matrix
---> 88             self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
     89 
     90         # Read matrix dimensions given, if any

/usr/local/lib/python3.6/dist-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    189                                          (shape, self._shape))
    190 
--> 191                 self.row, self.col = M.nonzero()
    192                 self.data = M[self.row, self.col]
    193                 self.has_canonical_format = True

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
    285             return self.nnz != 0
    286         else:
--> 287             raise ValueError("The truth value of an array with more than one "
    288                              "element is ambiguous. Use a.any() or a.all().")
    289     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

好的,这个错误并没有告诉我很多(至少没有更多地阅读代码),但很明显输入列表有问题.但是再次阅读 csr_matrix 文档!它是不是说我们可以给它一个稀疏矩阵列表?

OK, this error doesn't tell me a whole lot (at least without more reading of the code), but it's clearly having problems with the input list. But read csr_matrix docs again! Does it say we can give it a list of sparse matrices?

但是有一个 sparse.vstack 函数可以处理矩阵列表(以 np.vstack 为模型):

But there is a sparse.vstack function will work with a list of matrices (modeled on the np.vstack):

In [140]: sparse.vstack(alist)                                                                           
Out[140]: 
<2x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>

如果我们选择总和不为零的行,我们会得到更有趣的结果:

We get more interesting results if we select the rows that don't sum to zero:

In [141]: alist = [row for row in M if np.sum(row.toarray()[0])!=0]                                      
In [142]: M1=sparse.vstack(alist)                                                                        
In [143]: M1                                                                                             
Out[143]: 
<8x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

但我之前展示过,我们可以在不迭代的情况下获得行总和.将 where 应用到 Out[128],我得到行索引(非零行的):

But I showed before that we can get the row sums without iterating. Applying where to Out[128], I get the row indices (of the nonzero rows):

In [151]: idx=np.where(M.sum(axis=1))                                                                    
In [152]: idx                                                                                            
Out[152]: (array([0, 1, 2, 4, 5, 6, 8, 9]), array([0, 0, 0, 0, 0, 0, 0, 0]))
In [153]: M2=M[idx[0],:]                                                                                 
In [154]: M2                                                                                             
Out[154]: 
<8x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>
In [155]: np.allclose(M1.A, M2.A)                                                                        
Out[155]: True

====

我怀疑 In[137] 是试图找到输入的 nonzero (np.where) 元素或输入转换为 numpy 数组:

I suspect the In[137] was produced trying to find the nonzero (np.where) elements of the input, or input cast as a numpy array:

In [159]: alist = [row for row in M if np.sum(row.toarray()[0])==0]                                      
In [160]: np.array(alist)                                                                                
Out[160]: 
array([<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
       <1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>], dtype=object)
In [161]: np.array(alist).nonzero()                                                                      
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-161-832a25987c15> in <module>
----> 1 np.array(alist).nonzero()

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
    285             return self.nnz != 0
    286         else:
--> 287             raise ValueError("The truth value of an array with more than one "
    288                              "element is ambiguous. Use a.any() or a.all().")
    289     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

np.array 在稀疏矩阵列表上生成这些矩阵的对象 dtype 数组.

np.array on a list of sparse matrices produces an object dtype array of those matrices.

这篇关于如何从稀疏矩阵中选择一些行然后使用它们形成一个新的稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆