为什么word2Vec使用余弦相似度? [英] Why does word2Vec use cosine similarity?

查看:824
本文介绍了为什么word2Vec使用余弦相似度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读有关Word2Vec的论文(例如

I have been reading the papers on Word2Vec (e.g. this one), and I think I understand training the vectors to maximize the probability of other words found in the same contexts.

但是,我不明白为什么余弦是正确的单词相似度.余弦相似度表示两个向量指向相同的方向,但是它们的大小可能不同.

However, I do not understand why cosine is the correct measure of word similarity. Cosine similarity says that two vectors point in the same direction, but they could have different magnitudes.

例如,余弦相似度比较文档的单词袋是有意义的.两个文档的长度可能不同,但是单词的分布相似.

For example, cosine similarity makes sense comparing bag-of-words for documents. Two documents might be of different length, but have similar distributions of words.

为什么不说欧几里德距离呢?

Why not, say, Euclidean distance?

有人可以解释为什么余弦相似度对word2Vec有效吗?

Can anyone one explain why cosine similarity works for word2Vec?

推荐答案

两个n维向量A和B的余弦相似度定义为:

Cosine similarity of two n-dimensional vectors A and B is defined as:

这只是A和B之间的夹角的余弦.

which simply is the cosine of the angle between A and B.

而欧几里得距离定义为

现在考虑向量空间中两个随机元素的距离.对于余弦距离,最大距离为1,因为cos的范围为[-1,1].

Now think about the distance of two random elements of the vector space. For the cosine distance, the maximum distance is 1 as the range of cos is [-1, 1].

但是,对于欧式距离,它可以是任何非负值.

However, for the euclidean distance this can be any non-negative value.

当维度n变大时,两个随机选择的点的余弦距离越来越近,接近90°,而R ^ n的单位立方体中的点的欧氏距离约为0.41(n)^ 0.5 ()

When the dimension n gets bigger, two randomly chosen points have a cosine distance which gets closer and closer to 90°, whereas points in the unit-cube of R^n have an euclidean distance of roughly 0.41 (n)^0.5 (source)

余弦距离更好. (不过,我对此不太确定)

cosine distance is better for vectors in a high-dimensional space because of the curse of dimensionality. (I'm not absolutely sure about it, though)

这篇关于为什么word2Vec使用余弦相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆