ElasticSearch:我们可以在索引期间应用n-gram和语言分析器 [英] ElasticSearch : Can we apply both n-gram and language analyzers during indexing
问题描述
注意:我也添加了search_analyzer。我没有得到正确的结果。
但是,对于使用search_analyzer,我有以下疑问。
1]我们可以在语言分析器的情况下使用自定义search_analyzer吗?
2]我得到所有的结果,因为我使用的n-gram分析仪,而不是由于英语分析器?
{
设置:{
分析:{
analyzer:{
english_ngram:{
type:custom,
过滤器:[
english_possessive_stemmer,
smallcase,
english_stop,
english_stemmer,
ngram_filter
],
tokenizer:whitespace
},
search_analyzer:{
type:custom,
tokenizer:whitespace,
过滤器 :小写
}
},
过滤器:{
english_stop:{
type:stop
}
english_stemmer:{
type:stemmer,
language:english
},
english_possessive_stemmer:{
type:stemmer,
language:possive_english
},
ngram_filter:{
type:ngram,
min_gram:1,
max_gram:25
}
}
}
},
mappings:{
电影:{
properties:{
title:{
type:string,
fields:{
en :{
type:string,
ana lyzer:english_ngram,
search_analyzer:search_analyzer
}
}
}
}
}
}
}
更新:
使用搜索分析器也不一致,需要更多的帮助。更新我的发现问题。
我按照建议使用了以下映射(注意:此映射不使用搜索分析器),为简单起见,仅考虑英文分析器。
{
设置:{
analysis:{
analyzer:{
english_ngram:{
type:custom,
filter [
english_possessive_stemmer,
smallcase,
english_stop,
english_stemmer,
ngram_filter
,
tokenizer:standard
}
},
filter:{
english_stop:{
type :stop
},
english_stemmer:{
type:stemmer,
language:english
},
english_possessive_stemmer:{
type:stemmer,
language:possive_english
},
ngram_filter:{
类型:edge_ngram,
min_gram:1,
max_gram:25
}
}
}
}
创建索引:
PUT < a href =http:// localhost:9200 / movies / movie / 1 =nofollow noreferrer> http:// localhost:9200 / movies / m ovie / 1
{title:$ peci @ l movie}
尝试以下查询:
GET // http:// localhost:9200 / movies / movie / _search
{
query:{
multi_match:{
query $ peci mov,
fields:[title],
operator:和
}
}
}
我没有找到结果,我做错了吗?
我试图得到结果:
1]特殊字符
2]部分匹配
3]空格分隔的部分和全部单词
再次感谢
您可以根据语言分析器创建自定义分析器。唯一的区别是您将 ngram_filter
令牌过滤器添加到链的末尾。在这种情况下,您首先会得到语法干扰的令牌(默认链),转换为最终的边缘数据(您的过滤器)。您可以在这里找到语言分析器的实现 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer 以覆盖它们。以下是英文变更的示例:
{
settings:{
分析:{
analyzer:{
english_ngram:{
type:custom,
filter:[
english_possessive_stemmer ,
smallcase,
english_stop,
english_stemmer,
ngram_filter
],
tokenizer:standard
}
},
过滤器:{
english_stop:{
type:stop
},
$
bbbb
$ benglish_possessive_stemmer:{
type type:stemmer,
language:possive_english
},
ngram_filter:{
type:edge_ngram,
min_gram:1,
max_gram:25
}
}
}
}
}
更新
要支持特殊字符,您可以尝试使用空格
标记符而不是标准
。在这种情况下,这些字符将成为您的标记的一部分:
{
settings:{
分析:{
analyzer:{
english_ngram:{
type:custom,
filter:[
english_possessive_stemmer
smallcase,
english_stop,
english_stemmer,
ngram_filter
],
tokenizer:whitespace
}
},
过滤器:{
english_stop:{
type:stop
},
english_stemmer:{
type:stemmer,
language:english
},
english_possessive_stemmer:{
type:stemmer,
language:possive_english
},
ngram_filter:{
type:edge_ngram,
min_gram:1,
max_gram:25
}
}
}
}
}
Thanks a lot @Random , I have modified the mapping as follows. For testing I have used "movie" as my type for indexing. Note: I have added search_analyzer also. I was not getting proper results without that. However I have following doubts for using search_analyzer.
1] Can we use custom search_analyzer in case of language analyzers ?
2] am I getting all the results due to n-gram analyzer I have used and not due to english analyzer?
{
"settings": {
"analysis": {
"analyzer": {
"english_ngram": {
"type": "custom",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer",
"ngram_filter"
],
"tokenizer": "whitespace"
},
"search_analyzer":{
"type": "custom",
"tokenizer": "whitespace",
"filter": "lowercase"
}
},
"filter": {
"english_stop": {
"type": "stop"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 25
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"en": {
"type": "string",
"analyzer": "english_ngram",
"search_analyzer": "search_analyzer"
}
}
}
}
}
}
}
Update :
Using search analyzer also is not working consistently.and need more help with this.Updating question with my findings.
I used following mapping as suggested (Note: This mapping does not use search analyzer), for simplicity lets consider only English analyzer.
{
"settings": {
"analysis": {
"analyzer": {
"english_ngram": {
"type": "custom",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer",
"ngram_filter"
],
"tokenizer": "standard"
}
},
"filter": {
"english_stop": {
"type": "stop"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25
}
}
}
}
}
Created index:
PUT http://localhost:9200/movies/movie/1
{"title":"$peci@l movie"}
Tried following query:
GET http://localhost:9200/movies/movie/_search
{
"query": {
"multi_match": {
"query": "$peci mov",
"fields": ["title"],
"operator": "and"
}
}
}
}
I got no results for this, am I doing anything wrong ? I am trying to get results for:
1] Special characters
2] Partial matches
3] Space separated partial and full words
Thanks again !
You can create a custom analyzer based on language analyzers. The only difference is that you add your ngram_filter
token filter to the end of the chain. In this case you first get language-stemmed tokens (default chain) that converted to edge ngrams in the end (your filter). You can find the implementation of language analyzers here https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer in order to override them. Here is an example of this change for english language:
{
"settings": {
"analysis": {
"analyzer": {
"english_ngram": {
"type": "custom",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer",
"ngram_filter"
],
"tokenizer": "standard"
}
},
"filter": {
"english_stop": {
"type": "stop"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25
}
}
}
}
}
UPDATE
To support special characters you can try to use whitespace
tokenizer instead of standard
. In this case these characters will be part of your tokens:
{
"settings": {
"analysis": {
"analyzer": {
"english_ngram": {
"type": "custom",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"english_stop": {
"type": "stop"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25
}
}
}
}
}
这篇关于ElasticSearch:我们可以在索引期间应用n-gram和语言分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!