余弦相似度和tf-idf [英] Cosine similarity and tf-idf

查看:176
本文介绍了余弦相似度和tf-idf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对以下有关 TF-IDF 余弦相似度的评论感到困惑.

I am confused by the following comment about TF-IDF and Cosine Similarity.

我在余弦相似度上同时阅读了两者,然后在Wiki上阅读,我发现这句话:在信息检索的情况下,两个文档的余弦相似度范围从0到1,因为术语频率(tf-idf权重) )不能为负.两个项频率向量之间的夹角不能大于90."

I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."

现在我想知道....他们不是2件不同的东西吗?

Now I'm wondering....aren't they 2 different things?

tf-idf是否已经在余弦相似度内?如果是,那到底是什么-我只能看到内点积和欧几里得长度.

Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.

我认为tf-idf是在文本上余弦相似之前 可以做的事情.我错过了什么吗?

I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?

推荐答案

Tf-idf是应用于文本以获取两个实值向量的转换.然后,通过取向量的点积并将其除以其范数的乘积,可以得到任何一对向量的余弦相似度.得出向量之间角度的余弦值.

Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.

如果 d 2 q 是tf-idf向量,则

If d2 and q are tf-idf vectors, then

其中θ是向量之间的角度.由于θ的范围是0到90度,因此cos θ的范围是1到0.θ 的范围只能是0至90度,因为tf-idf向量是非负的.

where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.

tf-idf与余弦相似度/向量空间模型之间没有特别深的联系; tf-idf与文档项矩阵配合得很好.不过,它已在该域之外使用,并且原则上您可以在VSM中替代另一种转换.

There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.

(公式取自维基百科,因此是 d 2 .)

(Formula taken from the Wikipedia, hence the d2.)

这篇关于余弦相似度和tf-idf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆