如何使用elasticsearch正确处理多词同义词扩展？ [英] How to properly handle multi words synonym expansion using elasticsearch?

查看：173 发布时间：2020/10/28 18:34:10 elasticsearch elastic-stack elasticsearch-5

本文介绍了如何使用elasticsearch正确处理多词同义词扩展？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我具有以下同义词扩展名：

  suco => suco，refresco，bebida de soja

我想要的是用这种方式标记搜索：

搜索 suco de laranja将被标记为[ suco， laranja， refresco， bebida de soja]。

但我将其标记为[ suco， laranja， refresco， bebida， soja]。

请考虑 de 是停用词。我希望在 bebida de laranja成为[ bebida， laranja]之类的查询中被忽略。但是我不希望在同义词标记化上考虑它，因此 bebida de soja仍然保留为一个标记 bebida de soja。

我的设置：

  {
 settings：{
 analysis：{
 filter： {
 synonym_br：{
 type：同义词，
同义词：[
 suco => suco，refresco，bebida de soja 
] 
}，
 brazilian_stop：{
 type： stop，
 stopwords： _ brazilian_ 
} 
 }，
 analyzer：{
 synonyms：{
 filter：[
 synonym_br，
 lowercase，
 brazilian_stop，
 asciifolding 
]，
  type： custom，
 tokenizer： standard 
} 
} 
} 
} 
}

解决方案

我建议您进行以下两项更改。第一个与您提出的问题直接相关，第二个与建议有关。

不是使用多个同义词的扩展，相反，即所有同义词都指向一个单词的同义词。因此，将 suco => suco，refresco，bebida de soja 更改为 suco，refresco，bebida de soja => suco

更改同义词分析器中的过滤器顺序。将小写放在 synonym_br 之前。这样可以确保大小写不会影响 synonym_br 令牌过滤器。

因此最终设置将为：

  {
设置：{
分析 ：{{
 filter：{
 synonym_br：{
 type：同义词，
同义词：[
 suco，refresco， bebida de soja => suco 
] 
}，
 brazilian_stop：{
 type： stop，
 stopwords： _brazilian_  
} 
}，
 analyzer：{
 synonyms：{
 filter：[
 lowercase，
 synonym_br，
 brazilian_stop，
 asciifolding 
]，
 type： custom，
 tokenizer： standard 
} 
} 
} 
} 
}

这是如何工作的？

用于输入 b ebida de soja 过滤器按以下顺序应用：

 输入过滤器结果令牌
 = ================================= 
小写字母bebida，de，soja 
 synonym_br suco< ------以上所有令牌（包括头寸）均与同义词
 brazilian_stop suco 
 asciifolding suco 
   
 
 让我们看看 brazilian_stop 的作用。为此，我们需要输入一个与同义词不匹配，但其中包含 de 的输入。例如。  de soja ：
 输入过滤器结果令牌
 == ============================= 
小写字母soja 
 synonym_br de soja<- ------所有标记（独立或组合（包括位置））均不匹配任何同义词
 brazilian_stop soja< ------- de被删除，因为它是一个停用词
 asciifolding大豆
  
 
I have the following synonym expansion :
suco => suco, refresco, bebida de soja
What i want is to tokenize the search this way:

Search for "suco de laranja" would be tokenized to ["suco", "laranja", "refresco", "bebida de soja"].

But i'm getting it tokenized to ["suco", "laranja", "refresco", "bebida", "soja"].

Consider that the "de" word is a stop word. And i want it to be ignored on the query like "bebida de laranja" becomes ["bebida", "laranja"]. But i don't want it to be considered on the synonym tokenization so "bebida de soja" still stays as one token "bebida de soja".

my settings : 
{
    "settings":{
        "analysis":{
            "filter":{
                "synonym_br":{
                    "type":"synonym",
                    "synonyms":[
                        "suco => suco, refresco, bebida de soja"
                    ]
                },
                "brazilian_stop":{
                    "type":"stop",
                    "stopwords":"_brazilian_"
                }
            },
            "analyzer":{
                "synonyms":{
                    "filter":[
                        "synonym_br",
                        "lowercase",
                        "brazilian_stop",
                        "asciifolding"
                    ],
                    "type":"custom",
                    "tokenizer":"standard"
                }
            }
        }
    }
}

 解决方案 
I would suggest you to make following two changes. First one directly relates to the question you asked and the second one is a suggestion.

Instead of using expansion of multiple synonyms, do the opposite i.e. all the synonyms points to a single word synonym. So, change "suco => suco, refresco, bebida de soja" to "suco, refresco, bebida de soja => suco"
Change the order of filters in synonyms analyzer. Place lowercase before synonym_br. This will ensure that case does't effect synonym_br token filter.
So final settings will be:
{
  "settings": {
    "analysis": {
      "filter": {
        "synonym_br": {
          "type": "synonym",
          "synonyms": [
            "suco, refresco, bebida de soja => suco"
          ]
        },
        "brazilian_stop": {
          "type": "stop",
          "stopwords": "_brazilian_"
        }
      },
      "analyzer": {
        "synonyms": {
          "filter": [
            "lowercase",
            "synonym_br",
            "brazilian_stop",
            "asciifolding"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}


How does this work?

For input bebida de soja filter apply in the following order:
Input Filter        Result tokens
====================================
lowercase           bebida, de, soja
synonym_br          suco             <------- all the above tokens(including position) exactly matches a synonym
brazilian_stop      suco
asciifolding        suco
Let's see brazilian_stop in action. For this we need an input which doesn't match the synonym but have de in it. E.g. de soja:
Input Filter        Result tokens
=================================
lowercase           de, soja
synonym_br          de, soja  <------- none of the tokens (independently or combined(including position)) matches any synonym
brazilian_stop      soja      <------- de is removed as it is a stopword
asciifolding        soja


                        
这篇关于如何使用elasticsearch正确处理多词同义词扩展？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用elasticsearch正确处理多词同义词扩展？ [英] How to properly handle multi words synonym expansion using elasticsearch?

问题描述

这是如何工作的？

How does this work?

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用elasticsearch正确处理多词同义词扩展？ [英] How to properly handle multi words synonym expansion using elasticsearch?

问题描述

这是如何工作的？

How does this work?

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭