Elasticsearch如何匹配字段令牌是查询令牌的子集的文档 [英] Elasticsearch how to match documents for which the field tokens are a sub-set of the query tokens

查看:86
本文介绍了Elasticsearch如何匹配字段令牌是查询令牌的子集的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关键字/关键短语字段,我使用标准分析器将其标记化。如果要在其中包含此字段的所有标记的搜索词组,我希望此字段匹配。

I have a keyword/key-phrase field I tokenize using standard analyser. I want this field to match if if there is a search phrase that has all tokens of this field in it.

例如,如果字段值为 veni,vidi, vici,搜索词组为 Ceaser veni,vidi,vici,我希望此搜索词组匹配,但搜索词组 veni,vidi不匹配。

For example if the field value is "veni, vidi, vici" and the search phrase is "Ceaser veni,vidi,vici" I want this search phrase to match but search phrase "veni, vidi" not match.

我也需要 vidi,veni,vici(很奇怪!)进行匹配。因此,术语的位置和顺序并不是很重要。我认为短语匹配对我不太有用。

I also need "vidi, veni, vici" (weird!) to match. So the positions and ordering of the terms is not really important. A phrase match would not quite work for me I think.

在这个特定示例中,我可以使用带有 minimum_should_match参数的布尔查询,但这并不是我真正的意思。

I can use "bool query" with "minimum_should_match" parameter for this specific example but that is not really what I want as minimum should match is about ratio/number of tokens in the search phrase.

推荐答案

纯ES解决方案应该像这样。您将需要两个请求。

Pure ES solution would go like this. You will need two requests.

1)首先,您需要通过分析api 获取所有搜索令牌。

1) First you need to pass user query through analyze api to get all the search tokens.

curl -XGET 'localhost:9200/_analyze' -d '
{
  "analyzer" : "standard",
  "text" : "Ceaser veni,vidi,vici"
}'

您将获得4个代币 ceaser veni vidi vici 。您需要将这些令牌作为数组传递给下一个 search 请求。

you will get 4 tokens ceaser, veni, vidi, vici . You need to pass these tokens as an array to next search request.

2)我们需要搜索文档

2) We need to search for documents whose tokens are subset of search tokens.

{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "query": {
                "match": {
                  "title": "Ceaser veni,vidi,vici"
                }
              }
            },
            {
              "script": {
                "script": "if(search_tokens.containsAll(doc['title'].values)){return true;}",
                "params": {
                  "search_tokens": [
                    "ceaser",
                    "veni",
                    "vidi",
                    "vici"
                  ]
                }
              }
            }
          ]
        }
      }
    }
  }
}

筛选器中第一个 match query 的工作是缩小应在其上运行脚本的文档。 containsAll 方法将检查文档令牌是否为搜索令牌的子列表。这会很慢,但是可以使用您当前的设置来完成工作。您可以做的一大改进是将令牌存储为数组,以便将 doc ['title']。values 替换为该字段,从而改善脚本。

Here job of first match query inside the filter is to narrow down the documents on which script should run. containsAll method will check if the documents tokens are sublist of search tokens. This will be slow but will do the job with your current set up. One big improvement you can do is store tokens as an array so that doc['title'].values can be replaced with that field which will improve the script.

希望这会有所帮助!

这篇关于Elasticsearch如何匹配字段令牌是查询令牌的子集的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆