完成建议器的标记字符串 [英] Tokenizing string for completion suggester

查看:15
本文介绍了完成建议器的标记字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想要使用 Completion Suggester 构建电子商务网站的自动完成功能.

want to build the auto complete functionality of an e-commerce website, using Completion Suggester.

这是我的索引:

PUT myIndex
{
    "mappings": {
        "_doc" : {
            "properties" : {
                "suggest" : {
                    "type" : "completion"
                },
                "title" : {
                    "type": "keyword"
                }, 
                "category" : { 
                    "type": "keyword"
                },
                "description" : { 
                    "type": "keyword"
                }
            }
        }
    }
}

现在,在上传广告时,我希望标题字段用于自动完成,所以我是这样上传文档的:

Now, when uploading the advertisement I want the title field to be used for auto complete, so this is how I upload a document:

POST dummy/_doc
{
  "title": "Blue asics running shoes",
  "category": "sports",
  "description": "Nice blue running shoes, size 44 eu",
  "suggest": {
    "input": "Blue Asics running shoes" // <-- use title
  }
}

问题是,这样,弹性搜索只匹配从头开始的字符串......即Blu"会找到结果,但Asic"或Run"或Sho"不会返回任何东西......

Problem is, this way, elastic search only matches the string from beginning... i.e. "Blu" will find result but "Asic" or "Run" or "Sho" won't return anything...

所以我需要做的是像这样标记我的输入:

So what I need to do is to tokenize my input like this:

POST dummy/_doc
{
  "title": "Blue asics running shoes",
  "category": "sports",
  "description": "Nice blue running shoes, size 44 eu",
  "suggest": {
    "input": ["Blue", "Asics", "running", "shoes"] // <-- tokenized title
  }
}

这会工作得很好...但是我应该如何标记我的领域?我知道我可以在 c# 中拆分字符串,但是无论如何我可以在 Elasticsearch/Nest 中做到这一点?

This would work fine... But how am I supposed to tokenize my field? I know I can split the string in c#, but is there anyway that I can do this in Elasticsearch/Nest?

推荐答案

完成建议器 专为使用simple 分析器而不是standard 分析器,默认用于 text 数据类型.

Completion suggester is designed for fast search-as-you-type prefix queries, using a simple analyzer, and not the standard analyzer which is default for text datatypes.

如果您需要对标题中的任何标记进行部分前缀匹配,而不仅仅是从标题的开头开始,您可能需要考虑采用以下方法之一:

If you need partial prefix matching on any tokens in the title and not just from the beginning of the title, you may want to consider taking one of these approaches:

  1. 使用分析API 使用分析器将标题标记为您希望部分前缀匹配的标记/术语,并将此集合作为 input 索引到 completion 字段.标准分析器可能是一个不错的开始.

  1. use Analyze API with an analyzer that will tokenize the title into tokens/terms from which you would want to partial prefix match, and index this collection as the input to the completion field. The Standard analyzer may be a good one to start with.

请记住,完成建议器的数据结构在使用时保存在内存中,因此跨文档的高项基数将增加此数据结构的内存需求.还要考虑得分"匹配项的数量很简单,因为它由应用于每个输入的权重控制.

Bear in mind that the data structure for completion suggester is held in memory whilst in use, so high terms cardinality across documents will increase the memory demands of this data structure. Also consider that "scoring" of matching terms is simple in that it is controlled by the weight applied to each input.

  1. 不要在此处使用 Completion Suggester,而是将 title 字段设置为带有 multi-fields 包括 title 应采用的不同方式被分析(或不分析,例如使用 keyword 子字段).

  1. Don't use the Completion Suggester here and instead set up the title field as a text datatype with multi-fields that include the different ways that title should be analyzed (or not analyzed, with a keyword sub field for example).

花一些时间使用分析 API 来构建一个分析器,该分析器将允许在标题的任何位置使用部分词条前缀.首先,像标准分词器、小写标记过滤器、Edgengram 标记过滤器和可能的停止标记过滤器之类的东西会让您运行起来.另请注意,您需要一个 搜索分析器 做一些类似于索引分析器的事情除了 Edgengram 标记过滤器,因为搜索输入中的标记不需要被 ngrammed.

Spend some time with the Analyze API to build an analyzer that will allow for partial prefix of terms anywhere in the title. As a start, something like the Standard tokenizer, Lowercase token filter, Edgengram token filter and possibly Stop token filter would get you running. Also note that you would want a Search analyzer that does something similar to the Index analyzer except Edgengram token filter, as tokens in the search input would not need to be ngrammed.

这篇关于完成建议器的标记字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆