面向单词的补全建议器 (ElasticSearch 5.x) [英] Word-oriented completion suggester (ElasticSearch 5.x)

查看:22
本文介绍了面向单词的补全建议器 (ElasticSearch 5.x)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ElasticSearch 5.x 对 Suggester API 引入了一些(重大)更改(文档).最显着的变化如下:

ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:

完成建议是面向文档的

建议都知道他们所属的文件.现在,相关文档 (_source) 是作为完成建议的一部分返回.

Suggestions are aware of the document they belong to. Now, associated documents (_source) are returned as part of completion suggestions.

简而言之,所有完成查询都会返回所有匹配的文档,而不仅仅是匹配的单词.这就是问题所在 - 如果自动完成的单词出现在多个文档中,则会出现重复.

In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.

假设我们有这个简单的映射:

Let's say we have this simple mapping:

{
   "my-index": {
      "mappings": {
         "users": {
            "properties": {
               "firstName": {
                  "type": "text"
               },
               "lastName": {
                  "type": "text"
               },
               "suggest": {
                  "type": "completion",
                  "analyzer": "simple"
               }
            }
         }
      }
   }
}

附上几份测试文件:

{
   "_index": "my-index",
   "_type": "users",
   "_id": "1",
   "_source": {
      "firstName": "John",
      "lastName": "Doe",
      "suggest": [
         {
            "input": [
               "John",
               "Doe"
            ]
         }
      ]
   }
},
{
   "_index": "my-index",
   "_type": "users",
   "_id": "2",
   "_source": {
      "firstName": "John",
      "lastName": "Smith",
      "suggest": [
         {
            "input": [
               "John",
               "Smith"
            ]
         }
      ]
   }
}

还有一个按书查询:

POST /my-index/_suggest?pretty
{
    "my-suggest" : {
        "text" : "joh",
        "completion" : {
            "field" : "suggest"
        }
    }
}

这会产生以下结果:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "1",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Doe",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Doe"
                       ]
                    }
                 ]
               }
            },
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "2",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Smith",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Smith"
                       ]
                    }
                 ]
               }
            }
         ]
      }
   ]
}

简而言之,对于文本joh"的补全建议,返回了两 (2) 个 文档 - John 的文档和两者都具有相同的 text 属性值.

In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text property.

但是,我希望收到一 (1) 个单词.像这样简单的事情:

However, I would like to receive one (1) word. Something simple like this:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
          "John"
         ]
      }
   ]
}

问题:如何实现基于单词的完成提示.无需返回任何与文档相关的数据,因为我现在不需要它.

Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.

完成建议"是否适合我的场景?还是应该使用完全不同的方法?

Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?

编辑:正如你们中的许多人所指出的,额外的仅完成索引将是一个可行的解决方案.但是,我可以看到这种方法存在多个问题:

EDIT: As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:

  1. 保持新索引同步.
  2. 自动完成后续单词可能是全局的,而不是缩小范围.例如,假设您在附加索引中有以下单词:John"、Doe"、David"、Smith".查询 "John D" 时,不完整单词的结果应该是 "Doe" 而不是 "Doe", "David".
  1. Keeping the new index in sync.
  2. Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index: "John", "Doe", "David", "Smith". When querying for "John D", the result for the incomplete word should be "Doe" and not "Doe", "David".

要克服第二点,仅索引单个单词是不够的,因为您还需要将所有单词映射到文档,以便正确缩小自动完成后续单词的范围.有了这个,你实际上和查询原始索引有同样的问题.因此,附加索引不再有意义.

To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.

推荐答案

正如评论中所暗示的,在不获取重复文档的情况下实现此目的的另一种方法是为 firstname 包含字段 ngram 的字段.首先你像这样定义你的映射:

As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname field containing ngrams of the field. First you define your mapping like this:

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "completion_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase",
            "completion_filter"
          ],
          "tokenizer": "keyword"
        }
      },
      "filter": {
        "completion_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 24
        }
      }
    }
  },
  "mappings": {
    "users": {
      "properties": {
        "autocomplete": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "completion": {
              "type": "text",
              "analyzer": "completion_analyzer",
              "search_analyzer": "standard"
            }
          }
        },
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        }
      }
    }
  }
}

然后你索引几个文档:

POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }

然后您可以查询 joh 并获得 John 的一个结果和 Johnny

Then you can query for joh and get one result for John and another one for Johnny

{
  "size": 0,
  "query": {
    "term": {
      "autocomplete.completion": "john d"
    }
  },
  "aggs": {
    "suggestions": {
      "terms": {
        "field": "autocomplete.raw"
      }
    }
  }
}

结果:

{
  "aggregations": {
    "suggestions": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "John Doe",
          "doc_count": 1
        },
        {
          "key": "John Deere",
          "doc_count": 1
        }
      ]
    }
  }
}

更新(2019 年 6 月 25 日):

ES 7.2 引入了一种名为 search_as_you_type 的新数据类型,它本机允许这种行为.阅读更多内容:https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

这篇关于面向单词的补全建议器 (ElasticSearch 5.x)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆