余弦相似度产生'nan'值 [英] Cosine similarity yields 'nan' values
问题描述
我正在为稀疏向量计算余弦相似度矩阵,而预期为浮点数的元素似乎是'nan'.
I was calculating a Cosine Similarity Matrix for sparse vectors, and the elements expected to be float numbers appeared to be 'nan'.
访问次数"是一个稀疏矩阵,显示每个用户访问过每个网站的次数.该矩阵以前的形状为1 500 000 x 1500,但是我使用coo_matrix().tocsc()将其转换为稀疏矩阵.
'visits' is a sparse matrix showing how many times each user has visited each website. This matrix used to have a shape 1 500 000 x 1500, but I converted it into sparse matrix, using coo_matrix().tocsc().
任务是找出网站的相似程度,因此我决定计算每个网站之间的余弦指标.
The task is to find out, how similar the websites are, so I decided to calculate the cosine metric between each two sites.
这是我的代码:
cosine_distance_matrix = np.ndarray(shape = (visits.shape[1], visits.shape[1]))
def norm(x):
return np.sqrt(
x.T.dot(x)
)
for i in range(0, visits.shape[1]):
for k in range(0, i + 1):
normi_normk = norm(visits[:,i]) * norm(visits[:,k])
cosine_distance_matrix[i,k] = visits[:,i].T.dot(visits[:, k])/normi_normk
cosine_distance_matrix[k, i] = cosine_distance_matrix[i, k]
print cosine_distance_matrix
这就是我得到的! O_o
And this is what I have gotten! O_o
[[ 1. nan nan ..., nan nan nan]
[ nan 1. nan ..., nan nan nan]
[ nan nan 1. ..., nan nan nan]
...,
[ nan nan nan ..., 1. nan nan]
[ nan nan nan ..., nan 1. nan]
[ nan nan nan ..., nan nan 1.]]
该程序运行了3个小时...产生这样的垃圾而不是浮点数的原因是什么?
This program was running for 3 hours... What's the reason of such a trash instead of float numbers?
推荐答案
尝试:
def norm(x):
return np.sqrt((x.T*x).A)
我构造了一个较小的示例visits
矩阵,并使用您的代码计算了cosine_distance_matrix
.我的对角线是1s,在对角线的对角线上有很多nan
.我选择了nan
项之一,并查看了相应的i,k
计算.
I constructed a smaller sample visits
matrix, and calculated cosine_distance_matrix
with your code. Mine had the diagonal of 1s, and lots of nan
on the off diagonal. I choose one of the nan
items, and looked the the corresponding i,k
calculation.
In [690]: normi_normk = norm(visits[:,i]) * norm(visits[:,k])
In [691]: normi_normk
Out[691]:
<1x1 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>
In [692]: normi_normk.A
Out[692]: array([[ 18707.57953344]])
visits
是稀疏矩阵,所以visits[:,i]
也是稀疏矩阵(1列).您的norm
函数返回一个1x1的稀疏矩阵.
visits
is a sparse matrix, so visits[:,i]
is also sparse matrix (1 column). Your norm
function returns a 1x1 sparse matrix.
对于此对,此dot
为0,但仍为1x1稀疏矩阵:
For this pair, this dot
is 0, but it still a 1x1 sparse matrix:
In [718]: visits[:,i].T.dot(visits[:, k])
Out[718]:
<1x1 sparse matrix of type '<class 'numpy.int32'>'
with 0 stored elements in Compressed Sparse Column format>
这些稀疏矩阵的划分也很稀疏-和nan
.
The division of these sparse matricies is also sparse - and nan
.
In [717]: visits[:,i].T.dot(visits[:, k])/normi_normk
Out[717]: matrix([[ nan]])
但是如果将normi_normk
更改为标量或密集数组,我将得到0
But if I change normi_normk
to a scalar or dense array I get 0
In [722]: visits[:,i].T.dot(visits[:, k])/normi_normk.A
Out[722]: matrix([[ 0.]])
因此,我们必须将其从matrix/matrix
除法更改为涉及密集数组或标量的内容.可以通过多种方式进行更改.重写norm
以正确处理稀疏矩阵是一个.
So we have to change this from a matrix/matrix
division, to something involving dense arrays or scalars. It can be changed in various ways. Rewriting the norm
to handle sparse matrices correctly is one.
此外,我建议使用:
(visits[:,i].T*visits[:, k]).A/normi_normk
这样除法的两个项都是密集的.
so that both terms of the division are dense.
另一种可能性是使用visits[:,i].A
和visits[:,k].A
,因此内部循环计算是使用密集数组而不是这些矩阵进行的.
Another possibility is to use visits[:,i].A
and visits[:,k].A
, so the inner loop calculations are done with dense arrays rather than these matrices.
请注意,我没有做任何高级或特殊的事情.我只是详细研究了问题计算之一,并找到了nan
的来源.
Note that I'm not doing anything advanced or special. I just examined in detail one of the problem calculations, and found the source of the nan
.
我也建议使用np.zeros
初始化数组.我只在正常的zeros
,ones
,empty
不起作用时使用ndarray
.
I would also suggest using np.zeros
to initialize the array. I only use ndarray
when the normal zeros
, ones
, empty
don't work.
cosine_distance_matrix = np.zeros((visits.shape[1], visits.shape[1]))
总的来说,最好避免循环遍历i
和k
,使用矩阵乘积等来做所有事情.但是此修复程序将帮助您前进.
In the big picture it would best to avoid looping over i
and k
, doing everything with matrix products and such. But this fix will get you going.
这篇关于余弦相似度产生'nan'值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!