Lucene评分,向量空间模型的精度 [英] Lucene scoring, precision about vector space model

查看:51
本文介绍了Lucene评分,向量空间模型的精度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定如何在Lucene评分中使用向量空间模型.

I'm not sure to understand how vector space model is used in lucene scoring.

我在这里阅读了( https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html ),lucene会将文档的得分作为每个术语查询的tf-idf的总和(如果我们省略协调因子,场长和增强).我不明白如何使用向量空间模型.

I read here (https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html) that lucene scores a document as the sum of the tf-idf of each term query (if we omit coordination factor, field length and boosts). I don't understand how vector space model is used.

空间矢量模型可用于计算文档的tf-idf矢量和查询的tf-idf矢量之间的相似度.这应该给我们查询和文档之间的CosSimilarity分数.分数在0到1之间,因此不同的请求应该易于比较.

Space vector model could be used to calculate the similarity between the tf-idf vector of a document and the tf-idf vector of the query. This should give us a CosSimilarity score between the query and a document. The score would be between 0 and 1, so different requests should be easy to compare.

为什么不使用lucene score?

Why not using lucene score ?

推荐答案

Lucene使用链接中提到的实用分数函数",它近似于余弦相似度-扩展为支持实用"功能,例如增强.

Lucene uses the 'practical score function' mentioned in your link, which is an approximation of the cosine similarity - extended to support 'practical' features such as boosts.

如果对查询q和文档d采用向量空间余弦相似度公式,则您将:

If you take the vector space cosine similarity formula for a query q and a document d, you have:

s(q, d) = q * d / (||q|| * ||d||)

考虑到q和d是像 [tf(t1)* idf(t1),...] 这样的矢量,而在q矢量tf(t)中则为1或0,公式变为:

Considering that q and d are vectors like [tf(t1) * idf(t1), ...], and that in the q vector tf(t) is either 1 or 0, the formula becomes:

s(q, d) = ∑( tf(t in d) * idf(t)² )(t in q) / (||q|| * ||d||)

您可以根据定义 queryNorm = 1/√sumOfSquaredWeights

s(q, d) = queryNorm(q) * ∑( tf(t in d) * idf(t)² )(t in q) / ||d||

这与他们在文档中给出的公式很接近:

which is close to the formula they give in the docs:

score(q, d) = queryNorm(q) * coord(q,d) * 
              ∑ ( tf(t in d) * idf(t)² * t.getBoost() * norm(t,d)) (t in q)  

||| d || (文档向量的范数)在其公式方面没有直接等价的内容.

||d||, the norm of the document vector, however, does not have a direct equivalent in the terms of their formula.

这篇关于Lucene评分,向量空间模型的精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆