ElasticSearch:我们可以在索引期间应用n-gram和语言分析器 [英] ElasticSearch : Can we apply both n-gram and language analyzers during indexing

查看:145
本文介绍了ElasticSearch:我们可以在索引期间应用n-gram和语言分析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

感谢@Random,我已经修改了映射如下。对于测试,我使用电影作为我的索引类型。
注意:我也添加了search_analyzer。我没有得到正确的结果。
但是,对于使用search_analyzer,我有以下疑问。



1]我们可以在语言分析器的情况下使用自定义search_analyzer吗?

2]我得到所有的结果,因为我使用的n-gram分析仪,而不是由于英语分析器?

  {
设置:{
分析:{
analyzer:{
english_ngram:{
type:custom,
过滤器:[
english_possessive_stemmer,
smallcase,
english_stop,
english_stemmer,
ngram_filter
],
tokenizer:whitespace
},
search_analyzer:{
type:custom,
tokenizer:whitespace,
过滤器 :小写
}
},
过滤器:{
english_stop:{
type:stop
}
english_stemmer:{
type:stemmer,
language:english
},
english_possessive_stemmer:{
type:stemmer,
language:possive_english
},
ngram_filter:{
type:ngram,
min_gram:1,
max_gram:25
}
}
}
},
mappings:{
电影:{
properties:{
title:{
type:string,
fields:{
en :{
type:string,
ana lyzer:english_ngram,
search_analyzer:search_analyzer
}
}
}
}
}
}
}

更新:



使用搜索分析器也不一致,需要更多的帮助。更新我的发现问题。



我按照建议使用了以下映射(注意:此映射不使用搜索分析器),为简单起见,仅考虑英文分析器。

  {
设置:{
analysis:{
analyzer:{
english_ngram:{
type:custom,
filter [
english_possessive_stemmer,
smallcase,
english_stop,
english_stemmer,
ngram_filter

tokenizer:standard
}
},
filter:{
english_stop:{
type :stop
},
english_stemmer:{
type:stemmer,
language:english
},
english_possessive_stemmer:{
type:stemmer,
language:possive_english
},
ngram_filter:{
类型:edge_ngram,
min_gram:1,
max_gram:25
}
}
}
}

创建索引:



PUT < a href =http:// localhost:9200 / movies / movie / 1 =nofollow noreferrer> http:// localhost:9200 / movies / m ovie / 1

  {title:$ peci @ l movie} 

尝试以下查询:

  GET // http:// localhost:9200 / movies / movie / _search 

{
query:{
multi_match:{
query $ peci mov,
fields:[title],
operator:和
}
}
}

我没有找到结果,我做错了吗?
我试图得到结果:

  1]特殊字符
2]部分匹配
3]空格分隔的部分和全部单词

再次感谢

解决方案

您可以根据语言分析器创建自定义分析器。唯一的区别是您将 ngram_filter 令牌过滤器添加到链的末尾。在这种情况下,您首先会得到语法干扰的令牌(默认链),转换为最终的边缘数据(您的过滤器)。您可以在这里找到语言分析器的实现 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer 以覆盖它们。以下是英文变更的示例:

  {
settings:{
分析:{
analyzer:{
english_ngram:{
type:custom,
filter:[
english_possessive_stemmer ,
smallcase,
english_stop,
english_stemmer,
ngram_filter
],
tokenizer:standard
}
},
过滤器:{
english_stop:{
type:stop
},
$
bbbb

$ benglish_possessive_stemmer:{
type type:stemmer,
language:possive_english
},
ngram_filter:{
type:edge_ngram,
min_gram:1,
max_gram:25
}
}
}
}
}

更新



要支持特殊字符,您可以尝试使用空格标记符而不是标准。在这种情况下,这些字符将成为您的标记的一部分:

  {
settings:{
分析:{
analyzer:{
english_ngram:{
type:custom,
filter:[
english_possessive_stemmer
smallcase,
english_stop,
english_stemmer,
ngram_filter
],
tokenizer:whitespace
}
},
过滤器:{
english_stop:{
type:stop
},
english_stemmer:{
type:stemmer,
language:english
},
english_possessive_stemmer:{
type:stemmer,
language:possive_english
},
ngram_filter:{
type:edge_ngram,
min_gram:1,
max_gram:25
}
}
}
}
}


Thanks a lot @Random , I have modified the mapping as follows. For testing I have used "movie" as my type for indexing. Note: I have added search_analyzer also. I was not getting proper results without that. However I have following doubts for using search_analyzer.

1] Can we use custom search_analyzer in case of language analyzers ?
2] am I getting all the results due to n-gram analyzer I have used and not due to english analyzer?

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_ngram": {
                    "type": "custom",
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer",
                        "ngram_filter"
                    ],
                    "tokenizer": "whitespace"
                },
                "search_analyzer":{
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": "lowercase"
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 1,
                    "max_gram": 25
                }
            }
        }
    },
      "mappings": {
    "movie": {
      "properties": {
        "title": {
          "type": "string",
          "fields": {
            "en": {
              "type":     "string",
              "analyzer": "english_ngram",
              "search_analyzer": "search_analyzer"
            }
          }
        }
      }
    }
  }
}

Update :

Using search analyzer also is not working consistently.and need more help with this.Updating question with my findings.

I used following mapping as suggested (Note: This mapping does not use search analyzer), for simplicity lets consider only English analyzer.

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_ngram": {
                    "type": "custom",
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer",
                        "ngram_filter"
                    ],
                    "tokenizer": "standard"
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 25
                }
            }
        }
    }
}

Created index:

PUT http://localhost:9200/movies/movie/1

{"title":"$peci@l movie"}

Tried following query:

GET http://localhost:9200/movies/movie/_search

    {
        "query": {
            "multi_match": {
                "query": "$peci mov",
                "fields": ["title"],
                "operator": "and"
            }
            }
        }
    }

I got no results for this, am I doing anything wrong ? I am trying to get results for:

1] Special characters
2] Partial matches
3] Space separated partial and full words

Thanks again !

解决方案

You can create a custom analyzer based on language analyzers. The only difference is that you add your ngram_filter token filter to the end of the chain. In this case you first get language-stemmed tokens (default chain) that converted to edge ngrams in the end (your filter). You can find the implementation of language analyzers here https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer in order to override them. Here is an example of this change for english language:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_ngram": {
                    "type": "custom",
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer",
                        "ngram_filter"
                    ],
                    "tokenizer": "standard"
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 25
                }
            }
        }
    }
}

UPDATE

To support special characters you can try to use whitespace tokenizer instead of standard. In this case these characters will be part of your tokens:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_ngram": {
                    "type": "custom",
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer",
                        "ngram_filter"
                    ],
                    "tokenizer": "whitespace"
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 25
                }
            }
        }
    }
}

这篇关于ElasticSearch:我们可以在索引期间应用n-gram和语言分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆