Lucene评分，向量空间模型的精度 [英] Lucene scoring, precision about vector space model

查看：51 发布时间：2021/5/30 21:46:16 elasticsearch lucene similarity

本文介绍了Lucene评分，向量空间模型的精度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不确定如何在Lucene评分中使用向量空间模型.

I'm not sure to understand how vector space model is used in lucene scoring.

我在这里阅读了( https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html )，lucene会将文档的得分作为每个术语查询的tf-idf的总和(如果我们省略协调因子，场长和增强).我不明白如何使用向量空间模型.

I read here (https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html) that lucene scores a document as the sum of the tf-idf of each term query (if we omit coordination factor, field length and boosts). I don't understand how vector space model is used.

空间矢量模型可用于计算文档的tf-idf矢量和查询的tf-idf矢量之间的相似度.这应该给我们查询和文档之间的CosSimilarity分数.分数在0到1之间，因此不同的请求应该易于比较.

Space vector model could be used to calculate the similarity between the tf-idf vector of a document and the tf-idf vector of the query. This should give us a CosSimilarity score between the query and a document. The score would be between 0 and 1, so different requests should be easy to compare.

为什么不使用lucene score?

Why not using lucene score ?

推荐答案

Lucene使用链接中提到的实用分数函数"，它近似于余弦相似度-扩展为支持实用"功能，例如增强.

Lucene uses the 'practical score function' mentioned in your link, which is an approximation of the cosine similarity - extended to support 'practical' features such as boosts.

如果对查询q和文档d采用向量空间余弦相似度公式，则您将:

If you take the vector space cosine similarity formula for a query q and a document d, you have:

s(q, d) = q * d / (||q|| * ||d||)

考虑到q和d是像 [tf(t1)* idf(t1)，...] 这样的矢量，而在q矢量tf(t)中则为1或0，公式变为:

Considering that q and d are vectors like [tf(t1) * idf(t1), ...], and that in the q vector tf(t) is either 1 or 0, the formula becomes:

s(q, d) = ∑( tf(t in d) * idf(t)² )(t in q) / (||q|| * ||d||)

您可以根据定义 queryNorm = 1/√sumOfSquaredWeights

s(q, d) = queryNorm(q) * ∑( tf(t in d) * idf(t)² )(t in q) / ||d||

这与他们在文档中给出的公式很接近:

which is close to the formula they give in the docs:

score(q, d) = queryNorm(q) * coord(q,d) * 
              ∑ ( tf(t in d) * idf(t)² * t.getBoost() * norm(t,d)) (t in q)

||| d || (文档向量的范数)在其公式方面没有直接等价的内容.

||d||, the norm of the document vector, however, does not have a direct equivalent in the terms of their formula.

这篇关于Lucene评分，向量空间模型的精度的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Lucene评分，向量空间模型的精度 [英] Lucene scoring, precision about vector space model

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Lucene评分，向量空间模型的精度 [英] Lucene scoring, precision about vector space model

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭