word2vec的余弦相似度大于1 [英] Cosine similarity of word2vec more than 1

查看:458
本文介绍了word2vec的余弦相似度大于1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用spark的word2vec算法来计算文本的文档向量.

然后,我使用模型对象的 findSynonyms 函数来获取少量单词的同义词.

我看到这样的东西:

  w2vmodel.findSynonyms('science',4).show(5)+ ------------ + ------------------ +|字|相似度|+ ------------ + ------------------ +|物理|1.714908638833209 ||小说| 1.5189824643358183 || neuroscience | 1.4968051528391833 ||心理|1.458865636374223 |+ ------------ + ------------------ + 

我不明白为什么余弦相似度被计算为大于1.余弦相似度应在0和1之间或最大-1和+1之间(采用负角).

为什么这里多于1?怎么了?

解决方案

您应该归一化从 word2vec 获得的单词向量,否则将得到无界的点积或余弦相似度值.

摘录自 Levy等人,2015 (实际上,大多数关于词嵌入的文献):

向量在用于相似度计算之前已标准化为单位长度,从而使余弦相似度和点积等效.

如何规范化?

您可以执行以下操作.

 将numpy导入为npdef normalize(word_vec):norm = np.linalg.norm(word_vec)如果范数== 0:返回word_vec返回word_vec/norm 

参考

更新:为什么word2vec的余弦相似度大于1?

根据此答案,在word2vec的Spark实现中, findSynonyms 实际上并没有返回余弦距离,而是余弦距离乘以查询向量的范数.

排序和相对值与真实余弦距离一致,但实际值均按比例缩放.

I used a word2vec algorithm of spark to compute documents vector of a text.

I then used the findSynonyms function of the model object to get synonyms of few words.

I see something like this:

w2vmodel.findSynonyms('science',4).show(5)
+------------+------------------+
|        word|        similarity|
+------------+------------------+
|     physics| 1.714908638833209|
|     fiction|1.5189824643358183|
|neuroscience|1.4968051528391833|
|  psychology| 1.458865636374223|
+------------+------------------+

I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 (taking negative angles).

Why it is more than 1 here? What's going wrong here?

解决方案

You should normalize the word vectors that you got from word2vec, otherwise you would get unbounded dot product or cosine similarity values.

From Levy et al., 2015 (and, actually, most of the literature on word embeddings):

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.

How to do normalization?

You can do something like below.

import numpy as np

def normalize(word_vec):
    norm=np.linalg.norm(word_vec)
    if norm == 0: 
       return word_vec
    return word_vec/norm

References

Update: Why cosine similarity of word2vec is greater than 1?

According to this answer, in spark implementation of word2vec, findSynonyms doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector.

The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.

这篇关于word2vec的余弦相似度大于1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆