solr多值字段的评分 [英] Scoring of solr multivalued field
问题描述
如果我在 Solr 中有一个包含多值字段的文档,多个值是独立评分还是只是连接起来并作为一个大字段评分?我希望他们是独立得分的.这是我的意思的一个例子:
If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? I'm hoping they're scored independently. Here's an example of what I mean:
我有一个包含人名字段的文档,其中同一个人可能有多个名称.名称都不同(在某些情况下非常不同),但它们都是同一个人/文档.
I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.
第 1 个人:大卫鲍伊、大卫罗伯特琼斯、齐格星尘、瘦白公爵
Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke
第 2 个人:大卫莱特曼
Person 2: David Letterman
第 3 个人:大卫·哈塞尔霍夫、大卫·迈克尔·哈塞尔霍夫
Person 3: David Hasselhoff, David Michael Hasselhoff
如果我要搜索David",我希望所有这些都具有大致相同的匹配机会.如果每个名字都是独立评分的,情况似乎就是这样.如果它们只是作为单个字段存储和搜索,David Bowie 将因拥有比其他人多得多的令牌而受到惩罚.Solr 如何处理这种情况?
If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?
推荐答案
你可以用 debugQuery=on
运行你的查询 q=field_name:David
看看会发生什么.
You can just run your query q=field_name:David
with debugQuery=on
and see what happens.
这些是按score desc
排序的结果(包括通过fl=*,score
的分数):
These are the results (included the score through fl=*,score
) sorted by score desc
:
<doc>
<float name="score">0.4451987</float>
<str name="id">2</str>
<arr name="text_ws">
<str>David Letterman</str>
</arr>
</doc>
<doc>
<float name="score">0.44072422</float>
<str name="id">3</str>
<arr name="text_ws">
<str>David Hasselhoff</str>
<str>David Michael Hasselhoff</str>
</arr>
</doc>
<doc>
<float name="score">0.314803</float>
<str name="id">1</str>
<arr name="text_ws">
<str>David Bowie</str>
<str>David Robert Jones</str>
<str>Ziggy Stardust</str>
<str>Thin White Duke</str>
</arr>
</doc>
这就是解释:
<lst name="explain">
<str name="2">
0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
</str>
<str name="3">
0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
</str>
<str name="1">
0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
</str>
</lst>
这里的评分因素是:
- termFreq:术语在文档中出现的频率
- idf:该词在索引中出现的频率
- fieldNorm:术语的重要性,取决于索引时间提升和字段长度
- termFreq: how often a term appears in the document
- idf: how often the term appears across the index
- fieldNorm: importance of the term, depending on index-time boosting and field length
在您的示例中,fieldNorm
有所不同.您有一个具有较低 termFreq
的文档(1 而不是 1.4142135),因为该术语仅出现一次,但由于字段长度,该匹配更为重要.
In your example the fieldNorm
makes the difference. You have one document with lower termFreq
(1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.
您的字段是多值的这一事实不会改变得分.我想它与具有相同内容的单个值字段相同.Solr 在字段长度和术语方面起作用,所以,是的,大卫鲍伊因拥有比其他人多得多的令牌而受到惩罚.:)
The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)
更新
我实际上认为大卫鲍伊应该得到他的机会.如上所述,fieldNorm
有所作为.将属性 omitNorms=true
添加到 schema.xml
中的 text_ws
字段并重新索引.同样的查询会给你以下结果:
UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the fieldNorm
makes the difference. Add the attribute omitNorms=true
to your text_ws
field in the schema.xml
and reindex. The same query will give you the following result:
<doc>
<float name="score">1.0073696</float>
<str name="id">1</str>
<arr name="text">
<str>David Bowie</str>
<str>David Robert Jones</str>
<str>Ziggy Stardust</str>
<str>Thin White Duke</str>
</arr>
</doc>
<doc>
<float name="score">1.0073696</float>
<str name="id">3</str>
<arr name="text">
<str>David Hasselhoff</str>
<str>David Michael Hasselhoff</str>
</arr>
</doc>
<doc>
<float name="score">0.71231794</float>
<str name="id">2</str>
<arr name="text">
<str>David Letterman</str>
</arr>
</doc>
您现在可以看到 termFreq
获胜,而 fieldNorm
根本没有考虑在内.这就是为什么出现两次 David 的两个文档尽管长度不同,却排在首位并且得分相同,并且只有一个匹配项的较短文档是得分最低的最后一个文档.下面是 debugQuery=on
的解释:
As you can see now the termFreq
wins and the fieldNorm
is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation with debugQuery=on
:
<lst name="explain">
<str name="1">
1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
</str>
<str name="3">
1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
</str>
<str name="2">
0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
</str>
</lst>
这篇关于solr多值字段的评分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!