余弦相似度产生'nan'值 [英] Cosine similarity yields 'nan' values

查看:712
本文介绍了余弦相似度产生'nan'值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为稀疏向量计算余弦相似度矩阵,而预期为浮点数的元素似乎是'nan'.

I was calculating a Cosine Similarity Matrix for sparse vectors, and the elements expected to be float numbers appeared to be 'nan'.

访问次数"是一个稀疏矩阵,显示每个用户访问过每个网站的次数.该矩阵以前的形状为1 500 000 x 1500,但是我使用coo_matrix().tocsc()将其转换为稀疏矩阵.

'visits' is a sparse matrix showing how many times each user has visited each website. This matrix used to have a shape 1 500 000 x 1500, but I converted it into sparse matrix, using coo_matrix().tocsc().

任务是找出网站的相似程度,因此我决定计算每个网站之间的余弦指标.

The task is to find out, how similar the websites are, so I decided to calculate the cosine metric between each two sites.

这是我的代码:

cosine_distance_matrix = np.ndarray(shape = (visits.shape[1], visits.shape[1]))

def norm(x):
return np.sqrt(
    x.T.dot(x)
)

for i in range(0, visits.shape[1]):
  for k in range(0, i + 1):
    normi_normk = norm(visits[:,i]) * norm(visits[:,k])
    cosine_distance_matrix[i,k] = visits[:,i].T.dot(visits[:, k])/normi_normk
    cosine_distance_matrix[k, i] = cosine_distance_matrix[i, k]

print cosine_distance_matrix

这就是我得到的! O_o

And this is what I have gotten! O_o

[[  1.  nan  nan ...,  nan  nan  nan]
 [ nan   1.  nan ...,  nan  nan  nan]
 [ nan  nan   1. ...,  nan  nan  nan]
 ..., 
 [ nan  nan  nan ...,   1.  nan  nan]
 [ nan  nan  nan ...,  nan   1.  nan]
 [ nan  nan  nan ...,  nan  nan   1.]]

该程序运行了3个小时...产生这样的垃圾而不是浮点数的原因是什么?

This program was running for 3 hours... What's the reason of such a trash instead of float numbers?

推荐答案

尝试:

def norm(x):
    return np.sqrt((x.T*x).A)

我构造了一个较小的示例visits矩阵,并使用您的代码计算了cosine_distance_matrix.我的对角线是1s,在对角线的对角线上有很多nan.我选择了nan项之一,并查看了相应的i,k计算.

I constructed a smaller sample visits matrix, and calculated cosine_distance_matrix with your code. Mine had the diagonal of 1s, and lots of nan on the off diagonal. I choose one of the nan items, and looked the the corresponding i,k calculation.

In [690]: normi_normk = norm(visits[:,i]) * norm(visits[:,k])
In [691]: normi_normk
Out[691]: 
<1x1 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Column format>
In [692]: normi_normk.A
Out[692]: array([[ 18707.57953344]])

visits是稀疏矩阵,所以visits[:,i]也是稀疏矩阵(1列).您的norm函数返回一个1x1的稀疏矩阵.

visits is a sparse matrix, so visits[:,i] is also sparse matrix (1 column). Your norm function returns a 1x1 sparse matrix.

对于此对,此dot为0,但仍为1x1稀疏矩阵:

For this pair, this dot is 0, but it still a 1x1 sparse matrix:

In [718]: visits[:,i].T.dot(visits[:, k])
Out[718]: 
<1x1 sparse matrix of type '<class 'numpy.int32'>'
    with 0 stored elements in Compressed Sparse Column format>

这些稀疏矩阵的划分也很稀疏-和nan.

The division of these sparse matricies is also sparse - and nan.

In [717]: visits[:,i].T.dot(visits[:, k])/normi_normk
Out[717]: matrix([[ nan]])

但是如果将normi_normk更改为标量或密集数组,我将得到0

But if I change normi_normk to a scalar or dense array I get 0

In [722]: visits[:,i].T.dot(visits[:, k])/normi_normk.A
Out[722]: matrix([[ 0.]])

因此,我们必须将其从matrix/matrix除法更改为涉及密集数组或标量的内容.可以通过多种方式进行更改.重写norm以正确处理稀疏矩阵是一个.

So we have to change this from a matrix/matrix division, to something involving dense arrays or scalars. It can be changed in various ways. Rewriting the norm to handle sparse matrices correctly is one.

此外,我建议使用:

(visits[:,i].T*visits[:, k]).A/normi_normk

这样除法的两个项都是密集的.

so that both terms of the division are dense.

另一种可能性是使用visits[:,i].Avisits[:,k].A,因此内部循环计算是使用密集数组而不是这些矩阵进行的.

Another possibility is to use visits[:,i].A and visits[:,k].A, so the inner loop calculations are done with dense arrays rather than these matrices.

请注意,我没有做任何高级或特殊的事情.我只是详细研究了问题计算之一,并找到了nan的来源.

Note that I'm not doing anything advanced or special. I just examined in detail one of the problem calculations, and found the source of the nan.

我也建议使用np.zeros初始化数组.我只在正常的zerosonesempty不起作用时使用ndarray.

I would also suggest using np.zeros to initialize the array. I only use ndarray when the normal zeros, ones, empty don't work.

cosine_distance_matrix = np.zeros((visits.shape[1], visits.shape[1]))


总的来说,最好避免循环遍历ik,使用矩阵乘积等来做所有事情.但是此修复程序将帮助您前进.


In the big picture it would best to avoid looping over i and k, doing everything with matrix products and such. But this fix will get you going.

这篇关于余弦相似度产生'nan'值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆