使用NGram Tokenizer时,ElasticSearch不会尊重Max NGram的长度 [英] ElasticSearch does not respect Max NGram length while using NGram Tokenizer
问题描述
PUT ngramtest
{
mappings:{
MyEntity:{
properties:{
testField:{
type:text,
analyzer:my_analyzer
}
}
}
},
设置:{
分析:{
analyzer {
my_analyzer:{
tokenizer:my_tokenizer
}
},
tokenizer:{
my_tokenizer
type:ngram,
min_gram:3,
max_gram:5
}
}
}
}
将测试实体编入索引:
PUT ngramtest / MyEntity / 123
{
testField:Z / 16/000681
}
AND,这个查询奇怪的是结果
GET ngramtest / MyEntity / _search
{
query:{
match:{
testField:000681
}
}
}
我已经尝试过这个分析字符串:
code> POST ngramtest / _analyze
{
analyzer:my_analyzer,
text:Z / 16/000681。
}
如果我出错,有人可以更正我吗?
原因是因为您的分析器 my_analyzer
用于索引 AND 搜索。因此,当您搜索6个字符 abcdef
的单词时,该词还将在您搜索时由ngram分析仪分析,并生成令牌 abc
, abcd
, abcde
, bcd
等,那些将匹配索引的令牌。
您需要做的是指定要使用标准分析器作为 search_analyzer
testField:{
type:文本,
analyzer:my_analyzer,
search_analyzer:standard
}
在擦除索引并重新填充索引之前,您可以通过指定要在匹配查询中使用的搜索分析器来测试此理论:
GET ngramtest / MyEntity / _search
{
query:{
match:{
testField:{
query:000681,
analyzer:standard
}
}
}
}
I am using Ngram tokenizer and I have specified min_length as 3 and max_length as 5. However even if I try searching for a word of length greater than 5 , it still gives me the result.Its strange as ES will not index the combination with length 6 , but I am still able to retrieve the record.Is there any theory I am missing here? If not, what significance really does the max_length of NGram has? Following is the mapping that I tried..
PUT ngramtest
{
"mappings": {
"MyEntity":{
"properties": {
"testField":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
}
}
}
}
Indexed a test entity as:
PUT ngramtest/MyEntity/123
{
"testField":"Z/16/000681"
}
AND, this query weirdly yeilds results for
GET ngramtest/MyEntity/_search
{
"query": {
"match": {
"testField": "000681"
}
}
}
I have tried this for 'analyzing' the string:
POST ngramtest/_analyze
{
"analyzer": "my_analyzer",
"text": "Z/16/000681."
}
Can someone please correct me if I am going wrong?
The reason for this is because your analyzer my_analyzer
is used for indexing AND searching. Hence, when you search for a word of 6 characters abcdef
, that word will also be analyzed by your ngram analyzer at search time and produce the tokens abc
, abcd
, abcde
, bcd
, etc, and those will match the indexed tokens.
What you need to do is to specify that you want to use the standard analyzer as search_analyzer
in your mapping
"testField":{
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
Before wiping your index and repopulating it, you can test this theory simply by specifying the search analyzer to use in your match query:
GET ngramtest/MyEntity/_search
{
"query": {
"match": {
"testField": {
"query": "000681",
"analyzer": "standard"
}
}
}
}
这篇关于使用NGram Tokenizer时,ElasticSearch不会尊重Max NGram的长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!