Solr:得分百分比 [英] Solr: Scores As Percentages

查看:68
本文介绍了Solr:得分百分比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我已经看到 lucene doc ,它告诉我们不要产生分数(百分比):

First of all, I already saw the lucene doc which tells us to not produce score as percentages:

人们经常想计算从Lucene得分到确定什么是"100%完美"匹配与"50%"匹配.这是也叫归一化分数"

People frequently want to compute a "Percentage" from Lucene scores to determine what is a "100% perfect" match vs a "50%" match. This is also somethings called a "normalized score"

不要这样做.

严重.不要再这样想问题了,不是一切都会好起来的.

Seriously. Stop trying to think about your problem this way, it's not going to end well.

由于这些建议,我使用了另一种方法来解决我的问题.

Because of these recommandations, I used another way to solve my problem.

但是,Lucene的论点有几点,我真的不明白为什么在某些情况下它们会带来问题.

However, there are a few points of lucene's argumentation which I don't really understand why they are problematic in some cases.

对于这篇文章,我可以容易理解为什么不好:如果用户进行搜索并看到以下结果:

For the case of this post, I can easily understand why it is bad: if a user does a search and sees the following results:

  • 产品A:5星
  • 产品B:2星
  • ProductC:1星

如果在第一次搜索后删除了ProductA,那么下次用户来时,如果他看到以下结果,将会感到惊讶:

If ProductA was deleted after his first search, next time the user will come, he will be surprised if he sees the following results :

  • 产品B:5星
  • 产品C:3星

所以,这个问题正是Lucene的文档所指出的.

现在,让我们再举一个例子.

Now, let's take another example.

想象一下,我们有一个电子商务网站,该网站将经典搜索" 语音搜索结合使用.语音搜索是为了避免由于拼写错误而导致的空结果数量最大.相对于经典搜索而言,语音结果的得分非常低.

Imagine we have an e-commerce website which is using 'classic search' combined with phonetic search. The phonetic search is here to avoid a maximum number of null results due to spelling mistakes. The score of phonetic results is very low relative to scores of classic search.

在这种情况下,第一个想法是仅返回至少达到最高分数的10%的结果.即使使用经典搜索,低于此阈值的结果也不会被视为与我们相关.

In this case, the first idea was to only return results which have at least 10% of the maximum score. Results under this threshold will not be considered as relevant for us, even with classic search.

如果我这样做,我没有上述帖子的问题,因为如果删除了文档,那么如果旧的第二个产品成为第一个产品并且用户不会将其删除,这似乎是合乎逻辑的非常惊讶(这与我将分数保留为浮点值时的行为相同.)

If I do that, I don't have the problem of the above post because if a document is deleted, it seems logical if the old second product became the first one and the user will not be very suprised (it is the same behavior as if I kept the score as float value).

此外,如果语音搜索的分数非常低(如我们预期的那样),我们将保持相同的行为,只返回相关分数.

Furthermore, if scores of phonetic search are very low, as we expect, we will keep the same behavior to only return relevant scores.

所以我的问题是:按照Lucene的建议对分数进行标准化总是不好吗?我的示例是个例外吗,或者即使对于我的示例,执行此操作也是一个坏主意吗?

So my questions are: is it always bad to normalize score as Lucene advises? Is my example an exception or is it a bad idea to do this even for my example?

推荐答案

问题是,您如何确定分界线,这是什么意思?

The problem is, how do you determine your cutoff, and what does it mean?

看一个例子可能更容易.假设我正在尝试按姓氏寻找人.我要搜索:

Might be easier to look at an example. Say I'm trying to look for people by last name. I'm going to search for:

  • 史密斯菲尔德"

我认为以下文档非常匹配:

And I have the following documents that I think are all a pretty good match:

  • smithfield-完全匹配
  • smithfielde-非常接近,声音相似,只有一个(静音)字母
  • smythfield-很近,听起来很像,一个元音变了
  • smithfelt-关闭了几个字母,但仍然相近且听起来相似
  • snithfield-听起来不像是声音,但只有一个字母掉了.也许是错字.
  • smittfield-再次,听起来不太像,可能是拼写错误或拼写错误
  • smythfelt-拼写不错,但可能是个误会
  • smithfieldings-相同的前缀

因此,我需要匹配四件事.应确保精确匹配具有最高分数,并且我们需要前缀,模糊和类似声音的匹配.所以让我们搜索:

So, I've got four things I need to match. Exact match should be ensured to have the highest score, and we want prefix, fuzzy and sound-alike matches. So lets search for:

smithfield smithfield* smithfield~2 metaphone:sm0flt

结果

  • smithfield ::: 2.3430576
  • smithfielde ::: 0.97367656
  • smythfield ::: 0.5657166
  • smithfelt ::: 0.50767094

<10%-未显示

  • snithfield ::: 0.2137136
  • smittfield ::: 0.2137136
  • smythfelt ::: 0.0691447
  • smithfieldings ::: 0.041700535

我认为 smithfieldings 是一场相当不错的比赛,但它离晋级还差得很远!少于最大值的 2%,不要紧10%!好吧,让我们尝试增强

I thought smithfieldings was a pretty good match, but it's nowhere even close to making the cut! It's less that 2% of the maximum, nevermind 10%! Okay, so let's try boosting

smithfield^4 smithfield*^2 smithfield~2 metaphone:sm0flt

结果

  • smithfield ::: 2.8812196
  • smithfielde ::: 0.5907072
  • smythfield ::: 0.30413133

<10%-不显示

  • smithfelt ::: 0.2729258
  • snithfield ::: 0.11489322
  • smittfield ::: 0.11489322
  • smithfieldings ::: 0.044836726
  • smythfelt ::: 0.037172448

那更糟!

在生产中,问题仍然更加严重.在现实世界中,您可能要处理冗长的复杂查询和全文文档.字段长度,匹配项的重复,协调因子,提升和众多查询词,所有这些因素都会影响得分.

And in production the problem be worse still. In the real world, you may be dealing with long complex queries, and full text documents. Field length, repetitions of matches, coordination factors, boosts, and numerous query terms, all of it factors into the score.

即使第二个仍然是有意义的,有趣的结果,看到第一个结果的得分比第二个要高一个数量级确实不是很不寻常.无法保证分数的平均分配,因此我们不知道10%的数字意味着什么.而且,Lucene的评分算法倾向于使差异变得既好又大.

It's really not all that unusual to see the first result be an order of magnitude higher in score than the second, even though the second is still a meaningful, interesting result. There isn't any guarantee of an even distribution of scores, so we don't know what the 10% figure means. And lucene's scoring algorithm tends to err on the side of making the differences nice and big.

它总是不好吗?我会说是的.如我所见,总是有两个更好的选择.

Is it always bad? I'd say yes. As I see it, there are always two better options.

1 - 通过良好的查询控制您的结果集.如果您很好地构建查询,那么 that 将提供结果的截断值,这不是因为分数的任意截断,而是因为根本不会对其进行分数.

1 - Control your result set with good queries. If you construct your query well, then that will provide the cutoff of your results, not because of some arbitrary cutoff in score, but because it won't be scored at all.

2-如果您不想这样做,那么通过在任意点截取结果真的可以得到任何收获吗?用户非常擅长识别搜索结果何时超出了预期的范围.用户无法找到他们想要的东西是一个严重的烦恼.只要订购得当,显示太多结果通常是没有问题的.

2 - If you don't want to do that, do you really gain anything by cutting off results at that arbitrary point? Users are pretty good at recognizing when search results have gone off the deep end. A user not being able to find what they want is a serious annoyance. Showing too many results is usually a non-issue as long as they are ordered well.

这篇关于Solr:得分百分比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆