Lucene评分,向量空间模型的精度 [英] Lucene scoring, precision about vector space model
问题描述
我不确定如何在Lucene评分中使用向量空间模型.
I'm not sure to understand how vector space model is used in lucene scoring.
我在这里阅读了( https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html ),lucene会将文档的得分作为每个术语查询的tf-idf的总和(如果我们省略协调因子,场长和增强).我不明白如何使用向量空间模型.
I read here (https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html) that lucene scores a document as the sum of the tf-idf of each term query (if we omit coordination factor, field length and boosts). I don't understand how vector space model is used.
空间矢量模型可用于计算文档的tf-idf矢量和查询的tf-idf矢量之间的相似度.这应该给我们查询和文档之间的CosSimilarity分数.分数在0到1之间,因此不同的请求应该易于比较.
Space vector model could be used to calculate the similarity between the tf-idf vector of a document and the tf-idf vector of the query. This should give us a CosSimilarity score between the query and a document. The score would be between 0 and 1, so different requests should be easy to compare.
为什么不使用lucene score?
Why not using lucene score ?
推荐答案
Lucene使用链接中提到的实用分数函数",它近似于余弦相似度-扩展为支持实用"功能,例如增强.
Lucene uses the 'practical score function' mentioned in your link, which is an approximation of the cosine similarity - extended to support 'practical' features such as boosts.
如果对查询q和文档d采用向量空间余弦相似度公式,则您将:
If you take the vector space cosine similarity formula for a query q and a document d, you have:
s(q, d) = q * d / (||q|| * ||d||)
考虑到q和d是像 [tf(t1)* idf(t1),...]
这样的矢量,而在q矢量tf(t)中则为1或0,公式变为:
Considering that q and d are vectors like [tf(t1) * idf(t1), ...]
, and that in the q vector tf(t) is either 1 or 0, the formula becomes:
s(q, d) = ∑( tf(t in d) * idf(t)² )(t in q) / (||q|| * ||d||)
您可以根据定义 queryNorm = 1/√sumOfSquaredWeights
s(q, d) = queryNorm(q) * ∑( tf(t in d) * idf(t)² )(t in q) / ||d||
这与他们在文档中给出的公式很接近:
which is close to the formula they give in the docs:
score(q, d) = queryNorm(q) * coord(q,d) *
∑ ( tf(t in d) * idf(t)² * t.getBoost() * norm(t,d)) (t in q)
||| d ||
(文档向量的范数)在其公式方面没有直接等价的内容.
||d||
, the norm of the document vector, however, does not have a direct equivalent in the terms of their formula.
这篇关于Lucene评分,向量空间模型的精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!