如何使用elasticsearch正确处理多词同义词扩展? [英] How to properly handle multi words synonym expansion using elasticsearch?
问题描述
我具有以下同义词扩展名:
suco => suco,refresco,bebida de soja
我想要的是用这种方式标记搜索:
搜索 suco de laranja将被标记为[ suco, laranja, refresco, bebida de soja]。
但我将其标记为[ suco, laranja, refresco, bebida, soja]。
请考虑 de 是停用词。我希望在 bebida de laranja成为[ bebida, laranja]之类的查询中被忽略。但是我不希望在同义词标记化上考虑它,因此 bebida de soja仍然保留为一个标记 bebida de soja。
我的设置:
{
settings:{
analysis:{
filter: {
synonym_br:{
type:同义词,
同义词:[
suco => suco,refresco,bebida de soja
]
},
brazilian_stop:{
type: stop,
stopwords: _ brazilian_
}
},
analyzer:{
synonyms:{
filter:[
synonym_br,
lowercase,
brazilian_stop,
asciifolding
],
type: custom,
tokenizer: standard
}
}
}
}
}
我建议您进行以下两项更改。第一个与您提出的问题直接相关,第二个与建议有关。
-
不是使用多个同义词的扩展,相反,即所有同义词都指向一个单词的同义词。因此,将
suco => suco,refresco,bebida de soja
更改为suco,refresco,bebida de soja => suco
-
更改
同义词
分析器中的过滤器顺序。将小写
放在synonym_br
之前。这样可以确保大小写不会影响synonym_br
令牌过滤器。
因此最终设置将为:
{
设置:{
分析 :{{
filter:{
synonym_br:{
type:同义词,
同义词:[
suco,refresco, bebida de soja => suco
]
},
brazilian_stop:{
type: stop,
stopwords: _brazilian_
}
},
analyzer:{
synonyms:{
filter:[
lowercase,
synonym_br,
brazilian_stop,
asciifolding
],
type: custom,
tokenizer: standard
}
}
}
}
}
这是如何工作的?
用于输入 b ebida de soja
过滤器按以下顺序应用:
输入过滤器结果令牌
$ p $完全匹配p>
= =================================
小写字母bebida,de,soja
synonym_br suco< ------以上所有令牌(包括头寸)均与同义词
brazilian_stop suco
asciifolding suco
让我们看看
brazilian_stop
的作用。为此,我们需要输入一个与同义词不匹配,但其中包含de
的输入。例如。de soja
:输入过滤器结果令牌
== =============================
小写字母soja
synonym_br de soja<- ------所有标记(独立或组合(包括位置))均不匹配任何同义词
brazilian_stop soja< ------- de被删除,因为它是一个停用词
asciifolding大豆
I have the following synonym expansion :
suco => suco, refresco, bebida de soja
What i want is to tokenize the search this way:
Search for "suco de laranja" would be tokenized to ["suco", "laranja", "refresco", "bebida de soja"].
But i'm getting it tokenized to ["suco", "laranja", "refresco", "bebida", "soja"].
Consider that the "de" word is a stop word. And i want it to be ignored on the query like "bebida de laranja" becomes ["bebida", "laranja"]. But i don't want it to be considered on the synonym tokenization so "bebida de soja" still stays as one token "bebida de soja".
my settings :
{ "settings":{ "analysis":{ "filter":{ "synonym_br":{ "type":"synonym", "synonyms":[ "suco => suco, refresco, bebida de soja" ] }, "brazilian_stop":{ "type":"stop", "stopwords":"_brazilian_" } }, "analyzer":{ "synonyms":{ "filter":[ "synonym_br", "lowercase", "brazilian_stop", "asciifolding" ], "type":"custom", "tokenizer":"standard" } } } } }
解决方案I would suggest you to make following two changes. First one directly relates to the question you asked and the second one is a suggestion.
Instead of using expansion of multiple synonyms, do the opposite i.e. all the synonyms points to a single word synonym. So, change
"suco => suco, refresco, bebida de soja"
to"suco, refresco, bebida de soja => suco"
Change the order of filters in
synonyms
analyzer. Placelowercase
beforesynonym_br
. This will ensure that case does't effectsynonym_br
token filter.So final settings will be:
{ "settings": { "analysis": { "filter": { "synonym_br": { "type": "synonym", "synonyms": [ "suco, refresco, bebida de soja => suco" ] }, "brazilian_stop": { "type": "stop", "stopwords": "_brazilian_" } }, "analyzer": { "synonyms": { "filter": [ "lowercase", "synonym_br", "brazilian_stop", "asciifolding" ], "type": "custom", "tokenizer": "standard" } } } } }
How does this work?
For input
bebida de soja
filter apply in the following order:Input Filter Result tokens ==================================== lowercase bebida, de, soja synonym_br suco <------- all the above tokens(including position) exactly matches a synonym brazilian_stop suco asciifolding suco
Let's see
brazilian_stop
in action. For this we need an input which doesn't match the synonym but havede
in it. E.g.de soja
:Input Filter Result tokens ================================= lowercase de, soja synonym_br de, soja <------- none of the tokens (independently or combined(including position)) matches any synonym brazilian_stop soja <------- de is removed as it is a stopword asciifolding soja
这篇关于如何使用elasticsearch正确处理多词同义词扩展?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!