solr多值字段的评分 [英] Scoring of solr multivalued field

查看:33
本文介绍了solr多值字段的评分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我在 Solr 中有一个包含多值字段的文档,多个值是独立评分还是只是连接起来并作为一个大字段评分?我希望他们是独立得分的.这是我的意思的一个例子:

If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? I'm hoping they're scored independently. Here's an example of what I mean:

我有一个包含人名字段的文档,其中同一个人可能有多个名称.名称都不同(在某些情况下非常不同),但它们都是同一个人/文档.

I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.

第 1 个人:大卫鲍伊、大卫罗伯特琼斯、齐格星尘、瘦白公爵

Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke

第 2 个人:大卫莱特曼

Person 2: David Letterman

第 3 个人:大卫·哈塞尔霍夫、大卫·迈克尔·哈塞尔霍夫

Person 3: David Hasselhoff, David Michael Hasselhoff

如果我要搜索David",我希望所有这些都具有大致相同的匹配机会.如果每个名字都是独立评分的,情况似乎就是这样.如果它们只是作为单个字段存储和搜索,David Bowie 将因拥有比其他人多得多的令牌而受到惩罚.Solr 如何处理这种情况?

If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?

推荐答案

你可以用 debugQuery=on 运行你的查询 q=field_name:David 看看会发生什么.

You can just run your query q=field_name:David with debugQuery=on and see what happens.

这些是按score desc排序的结果(包括通过fl=*,score的分数):

These are the results (included the score through fl=*,score) sorted by score desc:

<doc>
    <float name="score">0.4451987</float>
    <str name="id">2</str>
    <arr name="text_ws">
        <str>David Letterman</str>
    </arr>
</doc>
<doc>
    <float name="score">0.44072422</float>
    <str name="id">3</str>
    <arr name="text_ws">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.314803</float>
    <str name="id">1</str>
    <arr name="text_ws">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>

这就是解释:

<lst name="explain">
    <str name="2">
        0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
    </str>
    <str name="3">
        0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
    </str>
    <str name="1">
        0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
    </str>
</lst>

这里的评分因素是:

  • termFreq:术语在文档中出现的频率
  • idf:该词在索引中出现的频率
  • fieldNorm:术语的重要性,取决于索引时间提升和字段长度
  • termFreq: how often a term appears in the document
  • idf: how often the term appears across the index
  • fieldNorm: importance of the term, depending on index-time boosting and field length

在您的示例中,fieldNorm 有所不同.您有一个具有较低 termFreq 的文档(1 而不是 1.4142135),因为该术语仅出现一次,但由于字段长度,该匹配更为重要.

In your example the fieldNorm makes the difference. You have one document with lower termFreq (1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.

您的字段是多值的这一事实不会改变得分.我想它与具有相同内容的单个值字段相同.Solr 在字段长度和术语方面起作用,所以,是的,大卫鲍伊因拥有比其他人多得多的令牌而受到惩罚.:)

The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)

更新
我实际上认为大卫鲍伊应该得到他的机会.如上所述,fieldNorm 有所作为.将属性 omitNorms=true 添加到 schema.xml 中的 text_ws 字段并重新索引.同样的查询会给你以下结果:

UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the fieldNorm makes the difference. Add the attribute omitNorms=true to your text_ws field in the schema.xml and reindex. The same query will give you the following result:

<doc>
    <float name="score">1.0073696</float>
    <str name="id">1</str>
    <arr name="text">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>
<doc>
    <float name="score">1.0073696</float>
    <str name="id">3</str>
    <arr name="text">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.71231794</float>
    <str name="id">2</str>
    <arr name="text">
        <str>David Letterman</str>
    </arr>
</doc>

您现在可以看到 termFreq 获胜,而 fieldNorm 根本没有考虑在内.这就是为什么出现两次 David 的两个文档尽管长度不同,却排在首位并且得分相同,并且只有一个匹配项的较短文档是得分最低的最后一个文档.下面是 debugQuery=on 的解释:

As you can see now the termFreq wins and the fieldNorm is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation with debugQuery=on:

<lst name="explain">
   <str name="1">
      1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
   </str>
   <str name="3">
      1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
   </str>
   <str name="2">
      0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
   </str>
</lst>

这篇关于solr多值字段的评分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆