0和1之间的余弦相似度 [英] Cosine similarity between 0 and 1

查看:141
本文介绍了0和1之间的余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对计算向量之间的相似度感兴趣,但是该相似度必须为0到1之间的一个数字.关于tf-idf和余弦相似度的问题很多,所有这些都表明该值位于0到1之间.a href ="https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure" rel ="noreferrer">维基百科:

I am interested in calculating similarity between vectors, however this similarity has to be a number between 0 and 1. There are many questions concerning tf-idf and cosine similarity, all indicating that the value lies between 0 and 1. From Wikipedia:

在信息检索的情况下,两个的余弦相似性文档的范围是0到1,因为术语频率"(使用tf–idf权重)不能为负.两项之间的夹角频率向量不能大于90°.

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (using tf–idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

特殊之处是我希望从两个不同的word2vec模型中计算两个向量之间的相似度.但是,这些模型已经对齐,因此它们实际上应该在相同的向量空间中表示它们的单词.我可以像这样计算 model_a 中的单词和 model_b 中的单词之间的相似度

The peculiarity is that I wish to calculate the similarity between two vectors from two different word2vec models. These models have been aligned, though, so they should in fact represent their words in the same vector space. I can calculate the similarity between a word in model_a and a word in model_b like so

import gensim as gs
from sklearn.metrics.pairwise import cosine_similarity

model_a = gs.models.KeyedVectors.load_word2vec_format(model_a_path, binary=False)
model_b = gs.models.KeyedVectors.load_word2vec_format(model_b_path, binary=False)

vector_a = model_a[word_a].reshape(1, -1)
vector_b = model_b[word_b].reshape(1, -1)

sim = cosine_similarity(vector_a, vector_b).item(0)

但是 sim 是[-1,1]范围内的相似性指标.有科学的方法可以将此映射到[0,1]范围吗?凭直觉,我会认为类似

But sim is then a similarity metric in the [-1,1] range. Is there a scientifically sound way to map this to the [0,1] range? Intuitively I would think that something like

norm_sim = (sim + 1) / 2

可以,但是就余弦相似度的实际含义而言,我不确定这是否是一个好习惯.如果不是,是否建议使用其他相似性指标?

is okay, but I'm not sure whether that is good practice with respect to the actual meaning of cosine similarity. If not, are other similarity metrics advised?

之所以我尝试将值设置为0到1之间的原因是因为数据将被传输给同事,该同事将其用作她的机器学习系统的功能,该机器期望所有值都在0之间1.她的直觉是取绝对价值,但是在我看来,这是一个更糟糕的选择,因为那时您将对立面映射为相同.但是,考虑到余弦相似度的实际含义,我可能是错的.因此,如果采用绝对值是一种好方法,我们也可以这样做.

The reason why I am trying to get the values to be between 0 and 1 is because the data will be transferred to a colleague who will use it as a feature for her machine learning system, which expects all values to be between 0 and 1. Her intuition was to take the absolute value, but that seems to me to be a worse alternative because then you map opposites to be identical. Considering the actual meaning of cosine similarity, though, I might be wrong. So if taking the absolute value is the good approach, we can do that as well.

推荐答案

您有充分的理由偏爱0.0-1.0(尽管许多学习算法在-1.0到1.0的范围内也可以正常工作).如果您的唯一目的是获得0.0-1.0的范围,则可以将norm_sim重新设置为-1.0到1.0到0.0到1.0的比例很好...但是,当然,结果值不再是真正的余弦相似度了.

You have a fair reason to prefer 0.0-1.0 (though many learning algorithms should do just fine with a -1.0 to 1.0 range). Your norm_sim rescaling of -1.0 to 1.0 to 0.0 to 1.0 is fine, if your only purpose is to get 0.0-1.0 ranges... but of course the resulting value isn't a true cosine-similarity anymore.

这些值不再是真正的全范围角度并不一定要紧.(如果该算法需要实际角度,则可以在-1.0到1.0之间使用.)

It won't necessarily matter that the values aren't real full-range angles any more. (If the algorithm needed real angles, it'd work with -1.0 to 1.0.)

使用无符号的绝对值将是一个坏主意,因为它将改变相似性的排名顺序-将某些本来"最不相似的结果向上移动.

Using the signless absolute value would be a bad idea, as it would change the rank order of similarities – moving some results that are "natively" most-dissimilar way up.

已经进行了一些工作,将字向量限制为仅在维度上具有非负值,而&通常的好处是,生成的尺寸更可能是可以单独解释的.(例如,请参见 https://cs.cmu.edu/~bmurphy/NNSE/.),但是gensim不支持此变体,&仅尝试它就能揭示它是否对任何特定项目都更好.

There's been work on constraining word-vectors to have only non-negative values in dimensions, & the usual benefit is that the resulting dimensions are more likely to be individually interpretable. (See for example https://cs.cmu.edu/~bmurphy/NNSE/.) However, gensim doesn't support this variant, & only trying it could reveal whether it would be better for any particular project.

此外,还有其他研究表明,通常的单词向量在原点周围可能未达到平衡"(因此,您会发现负余弦相似度比随机超球面中的点要少),并对其进行了偏移更加平衡通常会改善他们在其他任务上的表现.请参阅: https://arxiv.org/abs/1702.01417v2

Also, there's other research that suggests usual word-vectors may not be 'balanced' around the origin (so you'll see fewer negative cosine-similiarities than would be expected from points in a random hypersphere), and that shifting them to be more balanced will usually improve them for other tasks. See: https://arxiv.org/abs/1702.01417v2

这篇关于0和1之间的余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆