scipy.sparse.coo_matrix如何快速找到全零列,填充1并进行归一化 [英] scipy.sparse.coo_matrix how to fast find all zeros column, fill with 1 and normalize

查看:101
本文介绍了scipy.sparse.coo_matrix如何快速找到全零列,填充1并进行归一化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于矩阵,我想找到全零的列并用1填充,然后按列对矩阵进行归一化.我知道如何使用np.arrays

For a matrix, i want to find columns with all zeros and fill with 1s, and then normalize the matrix by column. I know how to do that with np.arrays

[[0 0 0 0 0]
 [0 0 1 0 0]
 [1 0 0 1 0]
 [0 0 0 0 1]
 [1 0 0 0 0]]      
     |
     V
[[0 1 0 0 0]
 [0 1 1 0 0]
 [1 1 0 1 0]    
 [0 1 0 0 1]
 [1 1 0 0 0]]
     |
     V
[[0   0.2 0 0 0]
 [0   0.2 1 0 0]
 [0.5 0.2 0 1 0]   
 [0   0.2 0 0 1]
 [0.5 0.2 0 0 0]]

但是,当矩阵为scipy.sparse.coo.coo_matrix格式时,如何将其转换回np.arrays,如何做同样的事情.我该如何实现同一件事?

But how can I do the same thing when the matrix is in scipy.sparse.coo.coo_matrix form, without converting it back to np.arrays. how can I achieve the same thing?

推荐答案

使用lil格式,并且使用行而不是列,将更加容易:

This will be a lot easier with the lil format, and working with rows rather than columns:

In [1]: from scipy import sparse
In [2]: A=np.array([[0,0,0,0,0],[0,0,1,0,0],[1,0,0,1,0],[0,0,0,0,1],[1,0,0,0,0]])
In [3]: A
Out[3]: 
array([[0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0],
       [0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0]])
In [4]: At=A.T                # switch to work with rows

In [5]: M=sparse.lil_matrix(At)

现在很明显哪一行全为零

Now it is obvious which row is all zeros

In [6]: M.data
Out[6]: array([[1, 1], [], [1], [1], [1]], dtype=object)
In [7]: M.rows
Out[7]: array([[2, 4], [], [1], [2], [3]], dtype=object)

lil格式允许我们填充该行:

And lil format allows us to fill that row:

In [8]: M.data[1]=[1,1,1,1,1]
In [9]: M.rows[1]=[0,1,2,3,4]
In [10]: M.A
Out[10]: 
array([[0, 0, 1, 0, 1],
       [1, 1, 1, 1, 1],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0]], dtype=int32)

我也可以使用M[1,:]=np.ones(5,int)

coo格式非常适合从data/row/col数组创建数组,但不能实现索引或数学运算.为此,必须将其转换为csr. csc用于面向列的东西.

The coo format is great for creating the array from the data/row/col arrays, but doesn't implement indexing or math. It has to be transformed to csr for that. And csc for column oriented stuff.

我填写的行在csr格式中不太明显:

The row that I filled isn't so obvious in the csr format:

In [14]: Mc=M.tocsr()
In [15]: Mc.data
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [16]: Mc.indices
Out[16]: array([2, 4, 0, 1, 2, 3, 4, 1, 2, 3], dtype=int32)
In [17]: Mc.indptr
Out[17]: array([ 0,  2,  7,  8,  9, 10], dtype=int32)

另一方面,使用这种格式进行规范化可能会更容易.

On the other hand normalizing is probably easier in this format.

In [18]: Mc.sum(axis=1)
Out[18]: 
matrix([[2],
        [5],
        [1],
        [1],
        [1]], dtype=int32)
In [19]: Mc/Mc.sum(axis=1)
Out[19]: 
matrix([[ 0. ,  0. ,  0.5,  0. ,  0.5],
        [ 0.2,  0.2,  0.2,  0.2,  0.2],
        [ 0. ,  1. ,  0. ,  0. ,  0. ],
        [ 0. ,  0. ,  1. ,  0. ,  0. ],
        [ 0. ,  0. ,  0. ,  1. ,  0. ]])

请注意,它已将稀疏矩阵转换为密集矩阵. sum是稠密的,而涉及稀疏和稠密的数学通常会产生稠密.

Notice that it's converted the sparse matrix to a dense one. The sum is dense, and math involving sparse and dense usually produces dense.

为了保留稀疏状态,我必须进行更全面的计算:

I have to use a more round about calculation to preserve the sparse status:

In [27]: Mc.multiply(sparse.csr_matrix(1/Mc.sum(axis=1)))
Out[27]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

这是使用csc格式(在A上)的一种方法

Here's a way of doing this with the csc format (on A)

In [40]: Ms=sparse.csc_matrix(A)
In [41]: Ms.sum(axis=0)
Out[41]: matrix([[2, 0, 1, 1, 1]], dtype=int32)

使用sum查找全零列.显然,如果列具有负值并且恰好合计为0,这可能是错误的.如果这是我担心的事情,我可以看到制作了一个矩阵副本,其中所有data值都替换为1.

Use sum to find the all-zeros column. Obviously this could be wrong if the columns have negative values and happen to sum to 0. If that's a concern I can see making a copy of the matrix with all data values replaced by 1.

In [43]: Ms[:,1]=np.ones(5,int)[:,None]
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csc_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
In [44]: Ms.A
Out[44]: 
array([[0, 1, 0, 0, 0],
       [0, 1, 1, 0, 0],
       [1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1],
       [1, 1, 0, 0, 0]])

如果您反复进行此类更改,则警告更重要.注意,我必须调整LHS阵列的尺寸.根据全零列的数量,此操作可以大大改变矩阵的稀疏性.

The warning matters more if you do this sort of change repeatedly. Notice I have to adjust the dimension of the LHS array. Depending on the number of all-zero columns this action can change the sparsity of the matrix substantially.

=================

==================

我可以使用以下方法在coo格式的col中搜索缺失值:

I could search the col of coo format for missing values with:

In [69]: Mo=sparse.coo_matrix(A)
In [70]: Mo.col
Out[70]: array([2, 0, 3, 4, 0], dtype=int32)

In [71]: Mo.col==np.arange(Mo.shape[1])[:,None]
Out[71]: 
array([[False,  True, False, False,  True],
       [False, False, False, False, False],
       [ True, False, False, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True, False]], dtype=bool)

In [72]: idx = np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]
In [73]: idx
Out[73]: array([1], dtype=int32)

然后我可以在此idx处添加1列:

I could then add a column of 1s at this idx with:

In [75]: N=Mo.shape[0]
In [76]: data = np.concatenate([Mo.data, np.ones(N,int)])
In [77]: row = np.concatenate([Mo.row, np.arange(N)])
In [78]: col = np.concatenate([Mo.col, np.ones(N,int)*idx])
In [79]: Mo1 = sparse.coo_matrix((data,(row, col)), shape=Mo.shape)
In [80]: Mo1.A
Out[80]: 
array([[0, 1, 0, 0, 0],
       [0, 1, 1, 0, 0],
       [1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1],
       [1, 1, 0, 0, 0]])

如所写,它仅适用于一列,但可以概括为几列.我还创建了一个新矩阵,而不是更新Mo.但这似乎也可以正常工作:

As written it works for just one column, but it could be generalized to several. I also created a new matrix rather than update Mo. But this in-place seems to work as well:

Mo.data,Mo.col,Mo.row = data,col,row

归一化仍然需要csr转换,尽管我认为sparse可以为您隐藏该内容.

The normalization still requires csr conversion, though I think sparse can hide that for you.

In [87]: Mo1/Mo1.sum(axis=0)
Out[87]: 
matrix([[ 0. ,  0.2,  0. ,  0. ,  0. ],
        [ 0. ,  0.2,  1. ,  0. ,  0. ],
        [ 0.5,  0.2,  0. ,  1. ,  0. ],
        [ 0. ,  0.2,  0. ,  0. ,  1. ],
        [ 0.5,  0.2,  0. ,  0. ,  0. ]])

即使我花了很多时间来维护稀疏性质,我仍然会得到一个csr矩阵:

Even when I take the extra work of maintaining the sparse nature, I still get a csr matrix:

In [89]: Mo1.multiply(sparse.coo_matrix(1/Mo1.sum(axis=0)))
Out[89]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

请参见

在熊猫稀疏矩阵中查找全零列

有关查找0列的更多方法.事实证明,Mo.col==np.arange(Mo.shape[1])[:,None]对于大型Mo而言太慢.使用np.in1d进行的测试要好得多.

for more methods of finding the 0 columns. It turns out Mo.col==np.arange(Mo.shape[1])[:,None] is too slow with large Mo. A test using np.in1d is much better.

1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)

这篇关于scipy.sparse.coo_matrix如何快速找到全零列,填充1并进行归一化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆