word2vec的余弦相似度大于1 [英] Cosine similarity of word2vec more than 1
问题描述
我使用spark的word2vec算法来计算文本的文档向量.
然后,我使用模型对象的 findSynonyms
函数来获取少量单词的同义词.
我看到这样的东西:
w2vmodel.findSynonyms('science',4).show(5)+ ------------ + ------------------ +|字|相似度|+ ------------ + ------------------ +|物理|1.714908638833209 ||小说| 1.5189824643358183 || neuroscience | 1.4968051528391833 ||心理|1.458865636374223 |+ ------------ + ------------------ +
我不明白为什么余弦相似度被计算为大于1.余弦相似度应在0和1之间或最大-1和+1之间(采用负角).
为什么这里多于1?怎么了?
您应该归一化从 word2vec
获得的单词向量,否则将得到无界的点积或余弦相似度值.>
摘录自 Levy等人,2015 (实际上,大多数关于词嵌入的文献):
向量在用于相似度计算之前已标准化为单位长度,从而使余弦相似度和点积等效.
如何规范化?
您可以执行以下操作.
将numpy导入为npdef normalize(word_vec):norm = np.linalg.norm(word_vec)如果范数== 0:返回word_vec返回word_vec/norm
参考
更新:为什么word2vec的余弦相似度大于1?
根据此答案,在word2vec的Spark实现中, findSynonyms
实际上并没有返回余弦距离,而是余弦距离乘以查询向量的范数.
排序和相对值与真实余弦距离一致,但实际值均按比例缩放.
I used a word2vec algorithm of spark to compute documents vector of a text.
I then used the findSynonyms
function of the model object to get synonyms of few words.
I see something like this:
w2vmodel.findSynonyms('science',4).show(5)
+------------+------------------+
| word| similarity|
+------------+------------------+
| physics| 1.714908638833209|
| fiction|1.5189824643358183|
|neuroscience|1.4968051528391833|
| psychology| 1.458865636374223|
+------------+------------------+
I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 (taking negative angles).
Why it is more than 1 here? What's going wrong here?
You should normalize the word vectors that you got from word2vec
, otherwise you would get unbounded dot product or cosine similarity values.
From Levy et al., 2015 (and, actually, most of the literature on word embeddings):
Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.
How to do normalization?
You can do something like below.
import numpy as np
def normalize(word_vec):
norm=np.linalg.norm(word_vec)
if norm == 0:
return word_vec
return word_vec/norm
References
- Should I do normalization to word embeddings from word2vec if I want to do semantic tasks?
- Should I normalize word2vec's word vectors before using them?
Update: Why cosine similarity of word2vec is greater than 1?
According to this answer, in spark implementation of word2vec, findSynonyms
doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector.
The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.
这篇关于word2vec的余弦相似度大于1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!