Elasticsearch在多个索引上的得分 [英] elasticsearch scoring on multiple indexes

查看:234
本文介绍了Elasticsearch在多个索引上的得分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个季度的索引( index-2015.1, index-2015.2 ...)

i have an index for any quarter of a year ("index-2015.1","index-2015.2"... )

我有大约3000万份文档

i have around 30 million documents on each index.

文档中有一个文本字段(标题)

a document has a text field ('title')

我的文档排序方法是(1)得分(2)创建日期

my document sorting method is (1)_score (2)created date

问题是:

当在上搜索某些文本时所有索引的标题字段( index-201 *),始终第一个结果来自一个索引。

when searching for some text on on 'title' field for all indexes ("index-201*"), always the first results is from one index.

让我说如果我正在搜索 title = home,并且我在 index-2015.1上有1万个文档,而title = home则在 index-2015.2上有1万个文档,标题结果是 home,则第一个结果是 index-2015.1中的所有文档(而不是 index-2015.2中的文档,或混合的),甚至在 index-2015.2中的文档都比 index-2015.2中的文档高 index-2015.1。

lets say if i am searching for 'title=home' and i have 10k documents on "index-2015.1" with title=home and 10k documents on "index-2015.2" with title=home then the first results are all documents from "index-2015.1" (and not from "index-2015.2", or mixed) even that on "index-2015.2" there are documents with "created date" higher then in "index-2015.1".

有这个原因吗?

推荐答案

原因可能是分数特定于索引。因此,如果您确实有多个索引,则每个索引的文档结果得分将(略有不同)计算。

The reason is probably, that the scores are specific to the index. So if you really have multiple indices, the result score of the documents will be calculated (slightly) different for each index.

简单地说,除匹配的文档取决于查询词及其在索引中的出现。分数是根据索引(实际上,默认情况下甚至是每个单独的分片)进行计算的。 elasticsearch可以进行一些规范化,但是我不知道这些细节。

Simply put, among other things, the score of a matching document is dependent on the query terms and their occurrences in the index. The score is calculated in regard to the index (actually, by default even to each separate shard). There are some normalizations elasticsearch does, but I don't know the details of those.

我真的不能很好地解释它,但这是有关计分的文章。我认为您至少要阅读有关TF / IDF的部分。我认为,这应该解释为什么您会得到不同的分数。

I'm not really able to explain it well, but here's the article about scoring. I think you want to read at least the part about TF/IDF. Which I think, should explain why you get different scores.

https://www.elastic.co/guide/zh-CN/elasticsearch/guide/current/scoring-theory.html

编辑:

因此,在我的机器上进行了一些测试之后,似乎有可能

So, after testing it a bit on my machine, it seems possible to use another search_type, to achieve a score suitable for your case.

POST /index1,index2/_search?search_type=dfs_query_then_fetch
{
    "query" : {
       "match": {
          "title": "home"
       }
    }
}

重要的部分是 search_type = dfs_query_then_fetch 。如果您正在编程Java或类似的东西,应该有一种在请求中指定它的方法。有关search_types的详细信息,请参见文档

The important part is search_type=dfs_query_then_fetch. If you are programming java or something similar, there should be a way to specify it in the request. For details about the search_types, refer to the documentation.

基本上,它将首先收集所有受影响分片(+索引)的期限频率。因此,应该对所有这些分数进行概括。

Basically it will first collect the term-frequencies on all affected shards (+ indexes). Therefore the score should be generalized over all these.

这篇关于Elasticsearch在多个索引上的得分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆