如何配置Elasticsearch以在单词的最后一个字符(而不是中间)找到子字符串? [英] How do I configure Elasticsearch to find substrings at the beginning OR at the end of a word (but not in middle)?

查看:432
本文介绍了如何配置Elasticsearch以在单词的最后一个字符(而不是中间)找到子字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始学习弹性搜索,现在我试图写我的第一个分析仪配置。我想要实现的是,如果它们在一个单词的开头或结尾,则会发现子字符串。如果我有stackoverflow这个词,我搜索stack我想找到它,当我搜索flow我想找到它,但是我不想想要找到它搜索ackov(在我的用例情况下这是没有意义的)。



我知道有边缘n克标记器,但一个分析器只能有一个标记器和边缘n-gram可以是前面还是后面(但不能同时两个)。



如果我理解正确,应用两个版本的过滤器(正面和背面)分析器,那么我也找不到,因为这两个过滤器都需要返回true,不是吗?因为堆栈不会在单词的结尾,所以后边缘格式过滤器将返回false,并且将找不到stackoverflow一词。



那么,如何配置我的分析器来找到一个单词的结尾或开头的子串,而不是在中间?

解决方案

可以做的是定义两个分析器,一个用于在字符串开始匹配,另一个用于在字符串末尾进行匹配。在下面的索引设置中,我命名前一个 prefix_edge_ngram_analyzer ,后一个 suffix_edge_ngram_analyzer 。这两个分析器可以应用于多字段字符串字段到 text.prefix 子字段,分别到 text.suffix string field。

  {
settings:{
analysis
analyzer:{
prefix_edge_ngram_analyzer:{
tokenizer:prefix_edge_ngram_tokenizer,
filter:[smallcase]
},
suffix_edge_ngram_analyzer:{
tokenizer:keyword,
filter:[smallcase,reverse,suffix_edge_ngram_filter,reverse]
}
},
tokenizer:{
prefix_edge_ngram_tokenizer:{
type:edgeNGram,
min_gram:2,
max_gram:25
}
},
filter:{
suffix_edge_ngram_filter:{
type:edgeNGram
min_gram:2,
max_gram:25
}
}
}
},
mappings:{
test_type:{
properties:{
text :{
type:string,
fields:{
前缀:{
type:string,
分析器:prefix_edge_ngram_analyzer
},
后缀:{
type:string,
analyzer:suffix_edge_ngram_analyzer
}
}
}
}
}
}
}

然后让我们索引以下测试文档:

  PUT test_index / test_type / 1 
{text:stackoverflow}

然后我们可以通过前缀或后缀使用以下查询:

 #input为stack=> 1个结果
GET test_index / test_type / _search?q = text.prefix:stack或text.suffix:stack

#input为flow=> 1个结果
GET test_index / test_type / _search?q = text.prefix:flow OR text.suffix:flow

#input是ackov=> 0结果
GET test_index / test_type / _search?q = text.prefix:ackov或text.suffix:ackov

使用查询DSL查询的另一种方法:

  POST test_index / test_type / _search 
{
query:{
multi_match:{
query:stack,
fields:[text。*]
}
}
}

更新



如果您已经有一个字符串字段,则可以将其升级到多字段,并使用其分析器创建两个必需的子字段。执行此操作的方法是按顺序执行此操作:


  1. 关闭索引以创建分析器

      POST test_index / _close 


  2. 更新索引设置

      PUT test_index / _settings 
    {
    分析:{
    analyzer:{
    prefix_edge_ngram_analyzer:{
    tokenizer:prefix_edge_ngram_tokenizer,
    filter:[smallcase]
    }
    suffix_edge_ngram_analyzer:{
    tokenizer:keyword,
    filter:[smallcase,reverse,suffix_edge_ngram_filter
    $ b},
    tokenizer:{
    prefix_edge_ngram_tokenizer:{
    type:edgeNGram,
    min_gram:2 ,
    max_gram:25
    }
    },
    过滤器:{
    suffix_edge_ngram_filter:{
    type edgeNGram,
    min_gram:2,
    max_gram:25
    }
    }
    }
    }


  3. 重新打开您的索引

      POST test_index / _open 


  4. 最后,更新文本字段的映射

      PUT test_index / _mapping / test_type 
    {
    properties:{
    text:{
    type:string,
    fields:{
    b $ btype:string,
    analyzer:prefix_edge_ngram_analyzer
    },
    suffix:{
    type:string b $ banalyzer:suffix_edge_ngram_analyzer
    }
    }
    }
    }
    }


  5. 您仍然需要重新索引所有文档,以便新的子字段 text.prefix text.suffix 进行填充和分析。



I'm starting to learn Elasticsearch and now I am trying to write my first analyser configuration. What I want to achieve is that substrings are found if they are at the beginning or ending of a word. If I have the word "stackoverflow" and I search for "stack" I want to find it and when I search for "flow" I want to find it, but I do not want to find it when searching for "ackov" (in my use case this would not make sense).

I know there is the "Edge n gram tokenizer", but one analyser can only have one tokenizer and the edge n-gram can either be front or back (but not both at the same time).

And if I understood correctly, applying both version of the "Edge ngram filter" (front and back) to the analyzer, then I would not find either, because both filters need to return true, isn't it? Because "stack" wouldn't be in the ending of the word, so the back edge n gram filter would return false and the word "stackoverflow" would not be found.

So, how do I configure my analyzer to find substrings either in the end or in the beginning of a word, but not in the middle?

解决方案

What can be done is to define two analyzers, one for matching at the start of a string and another to match at the end of a string. In the index settings below, I named the former one prefix_edge_ngram_analyzer and the latter one suffix_edge_ngram_analyzer. Those two analyzers can be applied to a multi-field string field to the text.prefix sub-field, respectively to the text.suffix string field.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "prefix_edge_ngram_analyzer": {
          "tokenizer": "prefix_edge_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "suffix_edge_ngram_analyzer": {
          "tokenizer": "keyword",
          "filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
        }
      },
      "tokenizer": {
        "prefix_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      },
      "filter": {
        "suffix_edge_ngram_filter": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "fields": {
            "prefix": {
              "type": "string",
              "analyzer": "prefix_edge_ngram_analyzer"
            },
            "suffix": {
              "type": "string",
              "analyzer": "suffix_edge_ngram_analyzer"
            }
          }
        }
      }
    }
  }
}

Then let's say we index the following test document:

PUT test_index/test_type/1
{ "text": "stackoverflow" }

We can then search either by prefix or suffix using the following queries:

# input is "stack" => 1 result
GET test_index/test_type/_search?q=text.prefix:stack OR text.suffix:stack

# input is "flow" => 1 result
GET test_index/test_type/_search?q=text.prefix:flow OR text.suffix:flow

# input is "ackov" => 0 result
GET test_index/test_type/_search?q=text.prefix:ackov OR text.suffix:ackov

Another way to query with the query DSL:

POST test_index/test_type/_search
{
   "query": {
      "multi_match": {
         "query": "stack",
         "fields": [ "text.*" ]
      }
   }
}

UPDATE

If you already have a string field, you can "upgrade" it to a multi-field and create the two required sub-fields with their analyzers. The way to do this would be to do this in order:

  1. Close your index in order to create the analyzers

    POST test_index/_close
    

  2. Update the index settings

    PUT test_index/_settings
    {
    "analysis": {
      "analyzer": {
        "prefix_edge_ngram_analyzer": {
          "tokenizer": "prefix_edge_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "suffix_edge_ngram_analyzer": {
          "tokenizer": "keyword",
          "filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
        }
      },
      "tokenizer": {
        "prefix_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      },
      "filter": {
        "suffix_edge_ngram_filter": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
    }
    

  3. Re-open your index

    POST test_index/_open
    

  4. Finally, update the mapping of your text field

    PUT test_index/_mapping/test_type
    {
      "properties": {
        "text": {
          "type": "string",
          "fields": {
            "prefix": {
              "type": "string",
              "analyzer": "prefix_edge_ngram_analyzer"
            },
            "suffix": {
              "type": "string",
              "analyzer": "suffix_edge_ngram_analyzer"
            }
          }
        }
      }
    }
    

  5. You still need to re-index all your documents in order for the new sub-fields text.prefix and text.suffix to be populated and analyzed.

这篇关于如何配置Elasticsearch以在单词的最后一个字符(而不是中间)找到子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆