Python稀疏矩阵删除重复的索引,除了一个? [英] Python sparse matrix remove duplicate indices except one?

查看:143
本文介绍了Python稀疏矩阵删除重复的索引,除了一个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在计算向量矩阵之间的余弦相似度,并且得到的结果是这样的稀疏矩阵:

I am computing the cosine similarity between matrix of vectors, and I get the result in a sparse matrix like this:

  • (0,26)0.359171459261
  • (0,25)0.121145761751
  • (0,24)0.316922015914
  • (0,23)0.157622038039
  • (0,22)0.636466644041
  • (0,21)0.136216495731
  • (0,20)0.243164535496
  • (0,19)0.348272617805
  • (0,18)0.636466644041
  • (0,17)1.0
  • (0, 26) 0.359171459261
  • (0, 25) 0.121145761751
  • (0, 24) 0.316922015914
  • (0, 23) 0.157622038039
  • (0, 22) 0.636466644041
  • (0, 21) 0.136216495731
  • (0, 20) 0.243164535496
  • (0, 19) 0.348272617805
  • (0, 18) 0.636466644041
  • (0, 17) 1.0

但是有重复的例子,例如:

But there are duplicates for example:

(0,24)0.316922015914和(24,0)0.316922015914

(0, 24) 0.316922015914 and (24, 0) 0.316922015914

我想要做的是通过索引将其删除,然后将其保留(如果我有(0,24),那么我就不需要(24,0),因为它是相同的)就只剩下其中之一并删除第二,针对矩阵中的所有向量. 目前,我有以下代码来创建矩阵:

What I want to do is to remove them by indice and be (if I have (0,24) then I don't need (24, 0) because it is the same) left with only one of this and remove the second, for all vectors in the matrix. Currently I have the following code to create the matrix:

vectorized_words = sparse.csr_matrix(vectorize_words(nostopwords,glove_dict))
cos_similiarity = cosine_similarity(vectorized_words,dense_output=False)

因此,总而言之,我不想删除所有重复项,我想使用pythonic方式仅保留其中一个重复项.

So to summarize I don't want to remove all duplicates, I want to be left with only one of them using the pythonic way.

提前谢谢!

推荐答案

我认为最简单的方法是获取coo格式矩阵的上三角:

I think it is easiest to get the upper-triangle of a coo format matrix:

首先制作一个小的对称矩阵:

First make a small symmetric matrix:

In [876]: A = sparse.random(5,5,.3,'csr')
In [877]: A = A+A.T
In [878]: A
Out[878]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 11 stored elements in Compressed Sparse Row format>
In [879]: A.A
Out[879]: 
array([[ 0.        ,  0.        ,  0.81388978,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.73944395,  0.20736975,  0.98968617],
       [ 0.81388978,  0.73944395,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.20736975,  0.        ,  0.05581152,  0.04448881],
       [ 0.        ,  0.98968617,  0.        ,  0.04448881,  0.        ]])

转换为coo,并将下三角数据值设置为0

Convert to coo, and set the lower-triangle data values to 0

In [880]: Ao = A.tocoo()
In [881]: mask = (Ao.row>Ao.col)
In [882]: mask
Out[882]: 
array([False, False, False, False,  True,  True,  True, False, False,
        True,  True], dtype=bool)
In [883]: Ao.data[mask]=0

转换回0,然后使用eliminate_zeros修剪矩阵.

Convert back to 0, and use eliminate_zeros to prune the matrix.

In [890]: A1 = Ao.tocsr()
In [891]: A1
Out[891]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 11 stored elements in Compressed Sparse Row format>
In [892]: A1.eliminate_zeros()
In [893]: A1
Out[893]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [894]: A1.A
Out[894]: 
array([[ 0.        ,  0.        ,  0.81388978,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.73944395,  0.20736975,  0.98968617],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.05581152,  0.04448881],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

coocsr格式均具有就地eliminate_zeros方法.

Both the coo and csr formats have a in-place eliminate_zeros method.

def eliminate_zeros(self):
    """Remove zero entries from the matrix

    This is an *in place* operation
    """
    mask = self.data != 0
    self.data = self.data[mask]
    self.row = self.row[mask]
    self.col = self.col[mask]

您可以使用此代码作为仅消除Lower_triangle值的模型,而不是使用Ao.data[mask]=0.

Instead of using Ao.data[mask]=0 you could this code as a model for eliminating just the lower_triangle values.

这篇关于Python稀疏矩阵删除重复的索引,除了一个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆