面向Word的完成建议(ElasticSearch 5.x) [英] Word-oriented completion suggester (ElasticSearch 5.x)

查看:105
本文介绍了面向Word的完成建议(ElasticSearch 5.x)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ElasticSearch 5.x对Suggester API引入了一些(突破性的)更改(文档)。最显着的变化如下:


完成建议书是面向文档的



建议知道他们所属的
文件。现在,相关文件( _source )是作为完成建议的一部分返回的



$ b简而言之,所有完成查询返回所有匹配的文档,而不是匹配的单词。这就是问题 - 如果自动完成的单词出现在多个文档中,则会重复出现。



假设我们有这个简单的映射:

  {
my-index:{
mappings:{
users:{
property:{
firstName:{
type:text
},
lastName:{
type:text
},
建议:{
类型:完成,
分析器:简单
}
}
}
}
}
}

几个测试文件:

  {
_index:my-index,
_type :user,
_id:1,
_source:{
firstName:John,
lastName:Doe
suggest:[
{
input:[
John,
Doe
]
}
]
}
},
{
_index:my-index,
_type:users,
_id:2,
_source $ bfirstName:John,
lastName:Smith,
suggest:[
{
input:[
约翰,
史密斯
]
}
]
}
}

还有一个本本查询:

  POST / my -index / _suggest?pretty 
{
my-suggest:{
text:joh,
completion:{
field :建议
}
}
}

以下结果:

  {
_shards:{
total:5,
success:5,
failed:0
},
my-suggest:[
{
text:joh,
offset:0,
length:3,
options:[
{
text:John,
_index:my-index,
_type:users,
_id ,
_score:1,
_source:{
firstName:John,
lastName:Doe,
:[
{
input:[
John,
Doe
]
}
]

},
{
text:John,
_index:my-index,
_type ,
_id:2,
_score:1,
_source:{
fir stName:John,
lastName:Smith,
suggest:[
{
input:[
John
史密斯
]
}
]
}
}
]
}
]
}

简而言之,完成建议文字joh,两(2)文件已返回 - 约翰和两者都具有相同的价值文本属性。



不过,我想收到一(1)个。这样简单:

  {
_shards:{
total:5,
success:5,
failed:0
},
my-suggest:[
{
text:joh ,
offset:0,
length:3,
options:[
John
]
}
]
}

问题:如何实现基于字的完成祝词。没有必要返回任何文档相关数据,因为我现在不需要它。



完成Suggester是否适合我的场景?或者我应该使用完全不同的方法?






编辑
你指出,一个额外的完成指数将是一个可行的解决方案。但是,我可以看到这种方法有多个问题:


  1. 保持新的索引同步。

  2. 自动完成后续单词可能是全球性的,而不是缩小。例如,假设您在附加索引中有以下单词:John,Doe,David,Smith。当查询John D时,不完整单词的结果应为Doe而不是Doe,David

为了克服第二点,单词不足够,因为您还需要将所有单词映射到文档,以便适当地缩小自动填充后续单词。与此同时,您实际上也有与查询原始索引相同的问题。因此,附加索引没有任何意义。

解决方案

如果在评论中暗示,获取重复的文档是为包含该字段的ngram的 firstname 字段创建一个子字段。首先,您定义如下所示的映射:

  PUT my-index 
{
settings:{
analysis:{
analyzer:{
completion_analyzer:{
type:custom,
filter:[
小写,
completion_filter
],
tokenizer:关键字
}
},
filter
completion_filter:{
type:edge_ngram,
min_gram:1,
max_gram:24
}
}
}
},
mappings:{
users:{
properties:{
autocomplete:{
type:text,
fields:{
raw:{
type:keyword
},
:{
type:text,
analyzer:completion_analyzer,
search _analyzer:standard
}
}
},
firstName:{
type:text
},
lastName:{
type:text
}
}
}
}
}

然后你索引几个文件:

  POST my-index / users / _bulk 
{index:{}}
{firstName:John,lastName:Doe,autocomplete Doe}
{index:{}}
{firstName:John,lastName:Deere,autocomplete:John Deere}
{ index:{}}
{firstName:Johnny,lastName:Cash,autocomplete:Johnny Cash}
pre>

然后,您可以查询 joh 并获取一个结果 John 另一个 Johnny

  {
size:0,
query:{
term:{
autocomplete.completion:john d
}
}
aggs:{
建议:{
条款:{
field:autocomplete.raw
}
}
}
}

结果:

  {
aggregate:{
suggestions:{
doc_count_error_upper_bound:0,
sum_other_doc_count:0,
buckets
{
key:John Doe,
doc_count:1
},
{
key:John Deere ,
doc_count:1
}
]
}
}
}


ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:

Completion suggester is document-oriented

Suggestions are aware of the document they belong to. Now, associated documents (_source) are returned as part of completion suggestions.

In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.

Let's say we have this simple mapping:

{
   "my-index": {
      "mappings": {
         "users": {
            "properties": {
               "firstName": {
                  "type": "text"
               },
               "lastName": {
                  "type": "text"
               },
               "suggest": {
                  "type": "completion",
                  "analyzer": "simple"
               }
            }
         }
      }
   }
}

With a few test documents:

{
   "_index": "my-index",
   "_type": "users",
   "_id": "1",
   "_source": {
      "firstName": "John",
      "lastName": "Doe",
      "suggest": [
         {
            "input": [
               "John",
               "Doe"
            ]
         }
      ]
   }
},
{
   "_index": "my-index",
   "_type": "users",
   "_id": "2",
   "_source": {
      "firstName": "John",
      "lastName": "Smith",
      "suggest": [
         {
            "input": [
               "John",
               "Smith"
            ]
         }
      ]
   }
}

And a by-the-book query:

POST /my-index/_suggest?pretty
{
    "my-suggest" : {
        "text" : "joh",
        "completion" : {
            "field" : "suggest"
        }
    }
}

Which yields the following results:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "1",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Doe",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Doe"
                       ]
                    }
                 ]
               }
            },
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "2",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Smith",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Smith"
                       ]
                    }
                 ]
               }
            }
         ]
      }
   ]
}

In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text property.

However, I would like to receive one (1) word. Something simple like this:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
          "John"
         ]
      }
   ]
}

Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.

Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?


EDIT: As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:

  1. Keeping the new index in sync.
  2. Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index: "John", "Doe", "David", "Smith". When querying for "John D", the result for the incomplete word should be "Doe" and not "Doe", "David".

To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.

解决方案

As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname field containing ngrams of the field. First you define your mapping like this:

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "completion_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase",
            "completion_filter"
          ],
          "tokenizer": "keyword"
        }
      },
      "filter": {
        "completion_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 24
        }
      }
    }
  },
  "mappings": {
    "users": {
      "properties": {
        "autocomplete": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "completion": {
              "type": "text",
              "analyzer": "completion_analyzer",
              "search_analyzer": "standard"
            }
          }
        },
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        }
      }
    }
  }
}

Then you index a few documents:

POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }

Then you can query for joh and get one result for John and another one for Johnny

{
  "size": 0,
  "query": {
    "term": {
      "autocomplete.completion": "john d"
    }
  },
  "aggs": {
    "suggestions": {
      "terms": {
        "field": "autocomplete.raw"
      }
    }
  }
}

Results:

{
  "aggregations": {
    "suggestions": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "John Doe",
          "doc_count": 1
        },
        {
          "key": "John Deere",
          "doc_count": 1
        }
      ]
    }
  }
}

这篇关于面向Word的完成建议(ElasticSearch 5.x)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆