使用NGram Tokenizer时,ElasticSearch不会尊重Max NGram的长度 [英] ElasticSearch does not respect Max NGram length while using NGram Tokenizer

查看:810
本文介绍了使用NGram Tokenizer时,ElasticSearch不会尊重Max NGram的长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Ngram tokenizer,我将min_length指定为3,max_length指定为5.但是即使我尝试搜索长度大于5的单词,仍然会给出结果。由于ES不会索引结合长度6,但我仍然可以检索记录。有没有我在这里失踪的理论?如果没有,NGram的max_length真的有什么意义?以下是我尝试过的映射。

  PUT ngramtest 
{
mappings:{
MyEntity:{
properties:{
testField:{
type:text,
analyzer:my_analyzer
}
}

}
},
设置:{
分析:{
analyzer {
my_analyzer:{
tokenizer:my_tokenizer
}
},
tokenizer:{
my_tokenizer
type:ngram,
min_gram:3,
max_gram:5
}
}
}

}

将测试实体编入索引:

  PUT ngramtest / MyEntity / 123 
{
testField:Z / 16/000681

}

AND,这个查询奇怪的是结果

  GET ngramtest / MyEntity / _search 
{
query:{
match:{
testField:000681
}
}
}

我已经尝试过这个分析字符串:

 code> POST ngramtest / _analyze 
{
analyzer:my_analyzer,
text:Z / 16/000681。
}

如果我出错,有人可以更正我吗?

解决方案

原因是因为您的分析器 my_analyzer 用于索引 AND 搜索。因此,当您搜索6个字符 abcdef 的单词时,该词还将在您搜索时由ngram分析仪分析,并生成令牌 abc abcd abcde bcd 等,那些将匹配索引的令牌。



您需要做的是指定要使用标准分析器作为 search_analyzer

 testField:{
type:文本,
analyzer:my_analyzer,
search_analyzer:standard
}

在擦除索引并重新填充索引之前,您可以通过指定要在匹配查询中使用的搜索分析器来测试此理论:

  GET ngramtest / MyEntity / _search 
{
query:{
match:{
testField:{
query:000681,
analyzer:standard
}
}
}
}


I am using Ngram tokenizer and I have specified min_length as 3 and max_length as 5. However even if I try searching for a word of length greater than 5 , it still gives me the result.Its strange as ES will not index the combination with length 6 , but I am still able to retrieve the record.Is there any theory I am missing here? If not, what significance really does the max_length of NGram has? Following is the mapping that I tried..

PUT ngramtest
{
  "mappings": {
    "MyEntity":{
      "properties": {
        "testField":{
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }

    }
  }, 
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5
        }
      }
    }
  }
}

Indexed a test entity as:

PUT ngramtest/MyEntity/123
{
  "testField":"Z/16/000681"

}

AND, this query weirdly yeilds results for

GET  ngramtest/MyEntity/_search
{
 "query": {
   "match": {
     "testField": "000681"
   }
 }
}

I have tried this for 'analyzing' the string:

POST ngramtest/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Z/16/000681."
}

Can someone please correct me if I am going wrong?

解决方案

The reason for this is because your analyzer my_analyzer is used for indexing AND searching. Hence, when you search for a word of 6 characters abcdef, that word will also be analyzed by your ngram analyzer at search time and produce the tokens abc, abcd, abcde, bcd, etc, and those will match the indexed tokens.

What you need to do is to specify that you want to use the standard analyzer as search_analyzer in your mapping

    "testField":{
      "type": "text",
      "analyzer": "my_analyzer",
      "search_analyzer": "standard"
    }

Before wiping your index and repopulating it, you can test this theory simply by specifying the search analyzer to use in your match query:

GET ngramtest/MyEntity/_search
{
  "query": {
    "match": {
      "testField": {
        "query": "000681", 
        "analyzer": "standard"
      }
    }
  }
}

这篇关于使用NGram Tokenizer时,ElasticSearch不会尊重Max NGram的长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆