精确匹配的正确排序和“开始于"(前缀)在 Elasticsearch 中 [英] Correct sorting for exact matches and "beginning with" (prefix) in Elasticsearch

查看:29
本文介绍了精确匹配的正确排序和“开始于"(前缀)在 Elasticsearch 中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用 Elasticsearch 改进搜索结果列表.

I need to improve the result list on search with Elasticsearch.

假设我们有 3 个具有单个字段和内容的文档,如下所示:

Lets say we have 3 documents with single field and content like this:

  • 苹果"
  • 青苹果"
  • 苹果树"

如果我搜索apple",结果可能会像这样排序:

If I search for "apple", it can happen, that I get the result sorted like this:

  • 青苹果"
  • 苹果树"
  • 苹果"

但我想要的是具有最高分的精确匹配,这里是带有apple"的文档.

But what I want is the exact match to have the highest score, here it is the document with "apple".

下一个最高分应该是搜索词开头的条目,这里是苹果树",其余排序默认方式.

Next highest score should be the entries beginning with the search word, here it is "apple tree" and rest sorted default way.

所以我想拥有它:

  • 苹果"
  • 苹果树"
  • 青苹果"

我试图通过使用 rescore 来实现它:

I have tried to achieve it by using rescore:

curl -X GET "http://localhost:9200/my_index_name/_search?size=10&pretty" -H 'Content-Type: application/json' -d'
{
   "query": {
      "query_string": {
          "query": "apple"
      }
   },
   "rescore": {
      "window_size": 500,
      "query": {
         "score_mode": "multiply",
         "rescore_query": {
            "bool": {
               "should": [
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple",
                           "boost": 4
                        }
                     }
                  },
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple*",
                           "boost": 2
                        }
                     }
                  }
               ]
            }
         },
         "query_weight": 0.7,
         "rescore_query_weight": 1.2
      }
   }
}'

但这并不是真的有效,因为 Elasticsearch 似乎用空格分隔所有单词.例如,搜索apple*"也会提供green apple".这似乎是 rescore 对我不起作用的原因.

But this not really works, because Elasticsearch seems to separate all words by white spaces. For example search for "apple*" will also deliver "green apple". That seems to be the reason why rescore is not working for me.

可能还有点."、-"、;"等其他字符Elasticsearch 用于拆分和弄乱我的排序的等等.

Possibly there are other characters like dots ".", "-", ";" etc. which Elasticsearch takes for splitting and mess up my sorting.

我还在rescore_query"中使用了match_phrase"而不是bool",但没有成功.

I also played around with "match_phrase" in "rescore_query" instead of "bool", but without success.

我也只尝试过一次匹配:

I also have tried with only one match this:

curl -X GET "http://localhost:9200/my_index_name/_search?size=10&pretty" -H 'Content-Type: application/json' -d'
{
   "query": {
      "query_string": {
          "query": "apple"
      }
   },
   "rescore": {
      "window_size": 500,
      "query": {
         "score_mode": "multiply",
         "rescore_query": {
            "bool": {
               "should": [
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple*",
                           "boost": 2
                        }
                     }
                  }
               ]
            }
         },
         "query_weight": 0.7,
         "rescore_query_weight": 1.2
      }
   }
}'

它似乎有效,但我仍然不确定.这是正确的做法吗?

And it seems to work, but I am still not sure. Would this be the correct way to do it?

对于其他查询,一个匹配重新评分无法正常工作.

With other queries the one match rescore is not working correct.

推荐答案

唯一需要对分数进行操作的地方是精确匹配,否则术语的位置顺序会给您正确的顺序.让我们通过以下方式理解这一点:

The only place where you require a manipulation in score is the exact match otherwise the order by position of terms give you the correct order. Lets understand this by the following:

让我们首先创建一个如下的映射:

Lets first create a mapping as below:

PUT test
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field1": {
          "type": "text",
          "analyzer": "whitespace",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

我使用 whitespace 分析器创建了字段 my_field1,以确保通过仅使用空格作为分隔符来创建标记.其次,我创建了一个名为 keyword 的子字段,类型为 keyword.keyword 将保存输入字符串的未分析值,我们将使用它进行精确匹配.

I have create field my_field1 with whitespace analyzer to make sure tokens are created by using space as only delimiter. Secondly, I have created a subfield named as keyword of type keyword. keyword will hold non-analyzed value of the input string and we'll use this for exact match.

让我们在索引中添加一些文档:

Lets add few docs to the index:

PUT test/_doc/1
{
  "my_field1": "apple"
}

PUT test/_doc/2
{
  "my_field1": "apple tree"
}

PUT test/_doc/3
{
  "my_field1": "green apple"
}

如果使用下面的查询来搜索词 apple 文档的顺序将是2,1,3.

If use the below query to search for term apple the order of docs will be 2,1,3.

POST test/_doc/_search
{
  "explain": true,
  "query": {
    "query_string": {
      "query": "apple",
      "fields": [
        "my_field1"
      ]
    }
  }
}

"explain": true 在上面的查询中给出了输出中的分数计算步骤.阅读本文将使您深入了解文档是如何评分的.

"explain": true in the above query give the score calculation steps in the output. Reading this will give you insight how a document is score.

我们需要做的就是提高精确匹配的分数.我们将对字段 my_field1.keyword 运行完全匹配.您可能有一个问题,为什么不使用 my_field1.这是因为分析了my_field1,当为3个文档的输入字符串生成token时,都会有一个token(term)apple(连同其他术语(如果存在)例如 tree 用于 doc 2 和 green 用于 doc 3)存储在此字段中.当我们在这个字段上为术语 apple 运行完全匹配时,所有文档都将匹配并对每个文档的分数产生类似的影响,因此分数没有变化.由于只有一个文档具有与 apple 相对于 my_field1.keyword 的精确值,因此该文档(文档 1)将与精确查询匹配,我们将提高这一点.所以查询将是:

All we need to do is, to boost the score for exact match. We'll run exact match against the field my_field1.keyword. You might have a question that why not my_field1. The reason for this is because my_field1 is analyzed, when tokens are generated for the input strings of the 3 docs, all will have a token (term) apple (along with other terms if present e.g. tree for doc 2 and green for doc 3) stored against this field. When we run exact match on this field for the term apple all docs will match and have similar effect on score for each document and hence no change in score. Since only one document have exact value as apple against my_field1.keyword that document (doc 1) will be a match for exact query and we'll boost this. So the query will be:

{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "apple",
            "fields": [
              "my_field1"
            ]
          }
        },
        {
          "query_string": {
            "query": ""apple"",
            "fields": [
              "my_field1.keyword^2"
            ]
          }
        }
      ]
    }
  }
}

上述查询的输出:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.7260925,
    "hits": [
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.7260925,
        "_source": {
          "my_field1": "apple"
        }
      },
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
          "my_field1": "apple tree"
        }
      },
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "my_field1": "green apple"
        }
      }
    ]
  }
}

这篇关于精确匹配的正确排序和“开始于"(前缀)在 Elasticsearch 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆