ElasticSearch 默认评分机制 [英] ElasticSearch default scoring mechanism

查看:22
本文介绍了ElasticSearch 默认评分机制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找的是关于 ElasticSearch (Lucene) 的默认评分机制如何真正起作用的简单、清晰的解释.我的意思是,它是使用 Lucene 评分,还是使用自己的评分?

例如,我想通过名称"字段搜索文档.我使用 .NET NEST 客户端来编写我的查询.让我们考虑这种类型的查询:

IQueryResponse<SomeEntity>queryResult = client.Search<SomeEntity>(s =>s.From(0).尺寸(300).解释().Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName"))));

翻译成这样的 JSON 查询:

<代码>{来自":0,大小":300,解释":是的,询问": {比赛": {名称": {查询":示例名称"}}}}

大约有 110 万份文档需要执行搜索.我得到的回报是(这只是结果的一部分,我自己格式化):

650 "ExampleName" 7,313398651示例名称"7,313398652示例名称"7,313398653示例名称"7,239194654示例名称"7,239194860某物的示例名称"4,5708737

其中第一个字段只是一个 Id,第二个是 ElasticSearch 执行搜索的名称字段,第三个是分数.

如您所见,ES 索引中有很多重复项.由于一些找到的文档具有不同的分数,尽管它们完全相同(只有不同​​的 ID),我得出的结论是,不同的分片对整个数据集的不同部分进行了搜索,这使我发现分数在某种程度上基于整体给定分片中的数据,而不仅仅是搜索引擎实际考虑的文档.

问题是,这个评分究竟是如何运作的?我的意思是,你能告诉我/告诉我/指出精确的公式来计算 ES 找到的每个文档的分数吗?最终,如何改变这种评分机制?

解决方案

默认评分是Lucene核心中的DefaultSimilarity算法,主要记录在这里.您可以通过配置 您自己的 Similarity 或使用类似于 custom_score 查询.p>

显示的前五个结果中的奇数变化似乎足够小,就查询结果的有效性及其排序而言,我并不关心它,但如果你想了解它的原因,explain api 可以准确地告诉你那里发生了什么.

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?

For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:

IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
    s.From(0)
   .Size(300)
   .Explain()
   .Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);

which is translated to such JSON query:

{
 "from": 0,
 "size": 300,
 "explain": true,
 "query": {
   "match": {
     "Name": {
       "query": "ExampleName"
      }
    }
  }
}

There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):

650   "ExampleName" 7,313398

651   "ExampleName" 7,313398

652   "ExampleName" 7,313398

653   "ExampleName" 7,239194

654   "ExampleName" 7,239194

860   "ExampleName of Something" 4,5708737  

where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.

As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.

The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?

解决方案

The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.

The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.

这篇关于ElasticSearch 默认评分机制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆