如何配置Elasticsearch以在单词的最后一个字符(而不是中间)找到子字符串? [英] How do I configure Elasticsearch to find substrings at the beginning OR at the end of a word (but not in middle)?
问题描述
我知道有边缘n克标记器,但一个分析器只能有一个标记器和边缘n-gram可以是前面还是后面(但不能同时两个)。
如果我理解正确,应用两个版本的过滤器(正面和背面)分析器,那么我也找不到,因为这两个过滤器都需要返回true,不是吗?因为堆栈不会在单词的结尾,所以后边缘格式过滤器将返回false,并且将找不到stackoverflow一词。
那么,如何配置我的分析器来找到一个单词的结尾或开头的子串,而不是在中间?
可以做的是定义两个分析器,一个用于在字符串开始匹配,另一个用于在字符串末尾进行匹配。在下面的索引设置中,我命名前一个 prefix_edge_ngram_analyzer
,后一个 suffix_edge_ngram_analyzer
。这两个分析器可以应用于多字段字符串字段到 text.prefix
子字段,分别到 text.suffix
string field。
{
settings:{
analysis
analyzer:{
prefix_edge_ngram_analyzer:{
tokenizer:prefix_edge_ngram_tokenizer,
filter:[smallcase]
},
suffix_edge_ngram_analyzer:{
tokenizer:keyword,
filter:[smallcase,reverse,suffix_edge_ngram_filter,reverse]
}
},
tokenizer:{
prefix_edge_ngram_tokenizer:{
type:edgeNGram,
min_gram:2,
max_gram:25
}
},
filter:{
suffix_edge_ngram_filter:{
type:edgeNGram
min_gram:2,
max_gram:25
}
}
}
},
mappings:{
test_type:{
properties:{
text :{
type:string,
fields:{
前缀:{
type:string,
分析器:prefix_edge_ngram_analyzer
},
后缀:{
type:string,
analyzer:suffix_edge_ngram_analyzer
}
}
}
}
}
}
}
然后让我们索引以下测试文档:
PUT test_index / test_type / 1
{text:stackoverflow}
然后我们可以通过前缀或后缀使用以下查询:
#input为stack=> 1个结果
GET test_index / test_type / _search?q = text.prefix:stack或text.suffix:stack
#input为flow=> 1个结果
GET test_index / test_type / _search?q = text.prefix:flow OR text.suffix:flow
#input是ackov=> 0结果
GET test_index / test_type / _search?q = text.prefix:ackov或text.suffix:ackov
使用查询DSL查询的另一种方法:
POST test_index / test_type / _search
{
query:{
multi_match:{
query:stack,
fields:[text。*]
}
}
}
更新
如果您已经有一个字符串字段,则可以将其升级到多字段,并使用其分析器创建两个必需的子字段。执行此操作的方法是按顺序执行此操作:
-
关闭索引以创建分析器
POST test_index / _close
-
更新索引设置
PUT test_index / _settings
{
分析:{
analyzer:{
prefix_edge_ngram_analyzer:{
tokenizer:prefix_edge_ngram_tokenizer,
filter:[smallcase]
}
suffix_edge_ngram_analyzer:{
tokenizer:keyword,
filter:[smallcase,reverse,suffix_edge_ngram_filter
$ b},
tokenizer:{
prefix_edge_ngram_tokenizer:{
type:edgeNGram,
min_gram:2 ,
max_gram:25
}
},
过滤器:{
suffix_edge_ngram_filter:{
type edgeNGram,
min_gram:2,
max_gram:25
}
}
}
}
-
重新打开您的索引
POST test_index / _open
-
最后,更新文本字段的映射
PUT test_index / _mapping / test_type
{
properties:{
text:{
type:string,
fields:{
b $ btype:string,
analyzer:prefix_edge_ngram_analyzer
},
suffix:{
type:string b $ banalyzer:suffix_edge_ngram_analyzer
}
}
}
}
}
-
您仍然需要重新索引所有文档,以便新的子字段
text.prefix
和text.suffix
进行填充和分析。
I'm starting to learn Elasticsearch and now I am trying to write my first analyser configuration. What I want to achieve is that substrings are found if they are at the beginning or ending of a word. If I have the word "stackoverflow" and I search for "stack" I want to find it and when I search for "flow" I want to find it, but I do not want to find it when searching for "ackov" (in my use case this would not make sense).
I know there is the "Edge n gram tokenizer", but one analyser can only have one tokenizer and the edge n-gram can either be front or back (but not both at the same time).
And if I understood correctly, applying both version of the "Edge ngram filter" (front and back) to the analyzer, then I would not find either, because both filters need to return true, isn't it? Because "stack" wouldn't be in the ending of the word, so the back edge n gram filter would return false and the word "stackoverflow" would not be found.
So, how do I configure my analyzer to find substrings either in the end or in the beginning of a word, but not in the middle?
What can be done is to define two analyzers, one for matching at the start of a string and another to match at the end of a string. In the index settings below, I named the former one prefix_edge_ngram_analyzer
and the latter one suffix_edge_ngram_analyzer
. Those two analyzers can be applied to a multi-field string field to the text.prefix
sub-field, respectively to the text.suffix
string field.
{
"settings": {
"analysis": {
"analyzer": {
"prefix_edge_ngram_analyzer": {
"tokenizer": "prefix_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"suffix_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
}
},
"tokenizer": {
"prefix_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"suffix_edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"text": {
"type": "string",
"fields": {
"prefix": {
"type": "string",
"analyzer": "prefix_edge_ngram_analyzer"
},
"suffix": {
"type": "string",
"analyzer": "suffix_edge_ngram_analyzer"
}
}
}
}
}
}
}
Then let's say we index the following test document:
PUT test_index/test_type/1
{ "text": "stackoverflow" }
We can then search either by prefix or suffix using the following queries:
# input is "stack" => 1 result
GET test_index/test_type/_search?q=text.prefix:stack OR text.suffix:stack
# input is "flow" => 1 result
GET test_index/test_type/_search?q=text.prefix:flow OR text.suffix:flow
# input is "ackov" => 0 result
GET test_index/test_type/_search?q=text.prefix:ackov OR text.suffix:ackov
Another way to query with the query DSL:
POST test_index/test_type/_search
{
"query": {
"multi_match": {
"query": "stack",
"fields": [ "text.*" ]
}
}
}
UPDATE
If you already have a string field, you can "upgrade" it to a multi-field and create the two required sub-fields with their analyzers. The way to do this would be to do this in order:
Close your index in order to create the analyzers
POST test_index/_close
Update the index settings
PUT test_index/_settings { "analysis": { "analyzer": { "prefix_edge_ngram_analyzer": { "tokenizer": "prefix_edge_ngram_tokenizer", "filter": ["lowercase"] }, "suffix_edge_ngram_analyzer": { "tokenizer": "keyword", "filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"] } }, "tokenizer": { "prefix_edge_ngram_tokenizer": { "type": "edgeNGram", "min_gram": "2", "max_gram": "25" } }, "filter": { "suffix_edge_ngram_filter": { "type": "edgeNGram", "min_gram": 2, "max_gram": 25 } } } }
Re-open your index
POST test_index/_open
Finally, update the mapping of your text field
PUT test_index/_mapping/test_type { "properties": { "text": { "type": "string", "fields": { "prefix": { "type": "string", "analyzer": "prefix_edge_ngram_analyzer" }, "suffix": { "type": "string", "analyzer": "suffix_edge_ngram_analyzer" } } } } }
You still need to re-index all your documents in order for the new sub-fields
text.prefix
andtext.suffix
to be populated and analyzed.
这篇关于如何配置Elasticsearch以在单词的最后一个字符(而不是中间)找到子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!