Solr多值字段评分 [英] Scoring of solr multivalued field

查看:75
本文介绍了Solr多值字段评分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我在Solr中有一个带有多值字段的文档,是对多个值进行独立评分还是将它们串联在一起并计为一个大字段?我希望他们能得到独立的评分.这是我的意思的示例:

If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? I'm hoping they're scored independently. Here's an example of what I mean:

我有一个文档,其中包含一个人的名字字段,其中同一个人可能有多个名字.名称都是不同的(在某些情况下,名称会有所不同),但是它们都是同一个人/文件.

I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.

人员1: 大卫·鲍伊,大卫·罗伯·琼斯,齐吉·星尘,瘦白公爵

Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke

人员2: 大卫·莱特曼(David Letterman)

Person 2: David Letterman

人员3: 大卫·哈塞尔霍夫(David Hasselhoff),大卫·迈克尔·哈塞尔霍夫(David Michael Hasselhoff)

Person 3: David Hasselhoff, David Michael Hasselhoff

如果我要搜索"David",我希望所有这些都具有相同的比赛机会.如果每个名字都被独立打分,情况似乎就是这样.如果仅将它们作为一个字段进行存储和搜索,则大卫·鲍伊(David Bowie)因拥有比其他人更多的令牌而将受到惩罚. Solr如何处理这种情况?

If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?

推荐答案

您只需使用debugQuery=on运行查询q=field_name:David,然后查看会发生什么情况.

You can just run your query q=field_name:David with debugQuery=on and see what happens.

这些是按score desc排序的结果(包括通过fl=*,score进行的得分):

These are the results (included the score through fl=*,score) sorted by score desc:

<doc>
    <float name="score">0.4451987</float>
    <str name="id">2</str>
    <arr name="text_ws">
        <str>David Letterman</str>
    </arr>
</doc>
<doc>
    <float name="score">0.44072422</float>
    <str name="id">3</str>
    <arr name="text_ws">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.314803</float>
    <str name="id">1</str>
    <arr name="text_ws">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>

这是解释:

<lst name="explain">
    <str name="2">
        0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
    </str>
    <str name="3">
        0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
    </str>
    <str name="1">
        0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
    </str>
</lst>

这里的得分因素是:

  • termFreq :术语在文档中出现的频率
  • idf :该词在索引中出现的频率
  • fieldNorm :该术语的重要性,取决于索引时间提升和字段长度
  • termFreq: how often a term appears in the document
  • idf: how often the term appears across the index
  • fieldNorm: importance of the term, depending on index-time boosting and field length

在您的示例中,fieldNorm有所作为.您有一个文档的termFreq较低(1而不是1.4142135),因为该术语仅出现一次,但是由于字段长度,该匹配更为重要.

In your example the fieldNorm makes the difference. You have one document with lower termFreq (1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.

您的字段是multiValued的事实不会改变得分.我想带有相同内容的单个值字段也将是相同的. Solr在字段长度和术语方面起作用,因此,是的,David Bowie因拥有比其他人更多的令牌而受到惩罚. :)

The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)

更新
我实际上认为David Bowie应该得到他的机会.像上面解释的那样,fieldNorm有所作为.在schema.xmltext_ws字段中添加属性omitNorms=true并重新索引.相同的查询将为您提供以下结果:

UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the fieldNorm makes the difference. Add the attribute omitNorms=true to your text_ws field in the schema.xml and reindex. The same query will give you the following result:

<doc>
    <float name="score">1.0073696</float>
    <str name="id">1</str>
    <arr name="text">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>
<doc>
    <float name="score">1.0073696</float>
    <str name="id">3</str>
    <arr name="text">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.71231794</float>
    <str name="id">2</str>
    <arr name="text">
        <str>David Letterman</str>
    </arr>
</doc>

如您所见,现在termFreq获胜,而fieldNorm根本没有被考虑在内.这就是为什么具有两个David出现次数的两个文档,尽管它们的长度不同,但得分最高,得分相同的原因,而只有一个匹配项的较短文档是得分最低的最后一个文档.这是debugQuery=on的解释:

As you can see now the termFreq wins and the fieldNorm is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation with debugQuery=on:

<lst name="explain">
   <str name="1">
      1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
   </str>
   <str name="3">
      1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
   </str>
   <str name="2">
      0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
   </str>
</lst>

这篇关于Solr多值字段评分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆