如果min_gram设置为1,则在ngram过滤器上的弹性搜索突出显示是奇怪的 [英] Elasticsearch highlighting on ngram filter is weird if min_gram is set to 1

查看:430
本文介绍了如果min_gram设置为1,则在ngram过滤器上的弹性搜索突出显示是奇怪的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有这个索引

{
  "settings":{
    "index":{
      "number_of_replicas":0,
      "analysis":{
        "analyzer":{
          "default":{
            "type":"custom",
            "tokenizer":"keyword",
            "filter":[
              "lowercase",
              "my_ngram"
            ]
          }
        },
        "filter":{
          "my_ngram":{
            "type":"nGram",
            "min_gram":2,
            "max_gram":20
          }
        }
      }
    }
  }
}

,我通过轮胎宝石执行此搜索

and I'm performing this search through the tire gem

{
   "query":{
      "query_string":{
         "query":"xyz",
         "default_operator":"AND"
      }
   },
   "sort":[
      {
         "count":"desc"
      }
   ],
   "filter":{
      "term":{
         "active":true,
         "_type":null
      }
   },
   "highlight":{
      "fields":{
         "name":{

         }
      },
      "pre_tags":[
         "<strong>"
      ],
      "post_tags":[
         "</strong>"
      ]
   }
}

我有两个帖子应该匹配命名为'xyz post'和'xyz问题'
当我执行这个搜索时,我正确地得到突出显示的字段正确的

and I have two posts that should match named 'xyz post' and 'xyz question' When I perform this search, I get the highlighted fields back properly

<strong>xyz</strong> question
<strong>xyz</strong> post

现在这里的东西...一旦我在我的索引和索引中将min_gram更改为1 。突出显示的字段开始回来,因为

Now here's the thing ... as soon as I change min_gram to 1 in my index and reindex. the highlighted fields start coming back as this

<strong>x</strong><strong>y</strong><strong>z</strong> pos<strong>xyz</strong>t
<strong>x</strong><strong>y</strong><strong>z</strong> questio<strong>xyz</strong>n

我根本无法理解为什么。 b $ b

I simply cannot understand why.

推荐答案

简短答案



您需要查看映射并查看您是否使用 fast-vector-highlighter 。但是您仍然需要对您的查询进行谨慎。

Short Answer

You need to check your mapping and see if you use fast-vector-highlighter. But still you need to be quite careful about your queries.

假设使用新的ES实例 0.20.4 on localhost

Assume using fresh instance of ES 0.20.4 on localhost.

在我们的例子中,我们来添加明确的映射。注意我为代码字段设置了两个不同的分析。唯一的区别是term_vector:with_positions_offsets

Building on top of your example, let's add explicit mappings. Note I setup two different analysis for the code field. The only difference is "term_vector":"with_positions_offsets".

curl -X PUT localhost:9200/myindex -d '
{
  "settings" : {
    "index":{
      "number_of_replicas":0,
      "number_of_shards":1,
      "analysis":{
        "analyzer":{
          "default":{
            "type":"custom",
            "tokenizer":"keyword",
            "filter":[
              "lowercase",
              "my_ngram"
            ]
          }
        },
        "filter":{
          "my_ngram":{
            "type":"nGram",
            "min_gram":1,
            "max_gram":20
          }
        }
      }
    }
  },
  "mappings" : {
    "product" : {
      "properties" : {
        "code" : {
          "type" : "multi_field",
          "fields" : {
            "code" : {
              "type" : "string",
              "analyzer" : "default",
              "store" : "yes"
            },
            "code.ngram" : {
              "type" : "string",
              "analyzer" : "default",
              "store" : "yes",
              "term_vector":"with_positions_offsets"
            }
          }
        }
      }
    }
  }
}'

索引一些数据。

curl -X POST 'localhost:9200/myindex/product' -d '{
  "code" : "Samsung Galaxy i7500"
}'

curl -X POST 'localhost:9200/myindex/product' -d '{
  "code" : "Samsung Galaxy 5 Europa"
}'

curl -X POST 'localhost:9200/myindex/product' -d '{
  "code" : "Samsung Galaxy Mini"
}'

现在我们可以运行查询。

And now we can run queries.

curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{
  "fields" : [ "code" ],
  "query" : {
    "term" : {
      "code" : "i"
    }
  },
  "highlight" : {
    "number_of_fragments" : 0,
    "fields" : {
      "code":{},
      "code.ngram":{}
    }
  }
}'

这产生两个搜索匹配:

# 1
...
"fields" : {
  "code" : "Samsung Galaxy Mini"
},
"highlight" : {
  "code.ngram" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ],
  "code" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ]
}
# 2
...
"fields" : {
  "code" : "Samsung Galaxy i7500"
},
"highlight" : {
  "code.ngram" : [ "Samsung Galaxy <em>i</em>7500" ],
  "code" : [ "Samsung Galaxy <em>i</em>7500" ]
}

code.ngem 这段时间正确突出显示。但是当使用更长的查询时,事情会发生变化:

Both the code and code.ngem fields were correctly highlighted this time. But things change quickly when longer query is used:

curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{
  "fields" : [ "code" ],
  "query" : {
    "term" : {
      "code" : "y m"
    }
  },
  "highlight" : {
    "number_of_fragments" : 0,
    "fields" : {
      "code":{},
      "code.ngram":{}
    }
  }
}'

这样产生:

"fields" : {
  "code" : "Samsung Galaxy Mini"
},
"highlight" : {
  "code.ngram" : [ "Samsung Galax<em>y M</em>ini" ],
  "code" : [ "Samsung Galaxy Min<em>y M</em>i" ]
}

代码字段没有高亮显示(类似于你的情况)

The code fields is not highlighted correctly (similar to your case).

重要的是使用术语查询而不是 query_string

这篇关于如果min_gram设置为1,则在ngram过滤器上的弹性搜索突出显示是奇怪的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆