弹性搜索查询字符串不要按字部分搜索 [英] elasticsearch query string dont search by word part
问题描述
我发送此请求
curl -XGET'host / process_test_3 / 14 / _search'-d'{
query:{
query_string:{
query:\* cor interface * \,
fields:[title ,obj_id]
}
}
}'
我得到正确的结果
{
taken:12,
timed_out
_shards:{
total:5,
successful:5,
failed:0
},
hits:{
total:3,
max_score:5.421598,
hits:[
{
_index:process_test_3 ,
_type:14,
_id:141_dashboard_14,
_score:5.421598,
_source:{
obj_type :dashboard,
obj_id:141,
title:Cor Interface Monitoring
}
}
]
}
}
但是当我想按字部分搜索时,例如
curl -XGET'host / process_test_3 / $ / $$$$$$$$$$$$ ,
fields:[title,obj_id]
}
}
}'
我没有得到任何结果:
{
take:4,
timed_out:false,
_shards:{
total:5,
success:5,
失败:0
},
hits:{
total:0,
max_score:null,
hits
}
}
我做错了什么?
这是因为您的标题
字段可能已被标准分析器(默认设置)和标题 Cor Interface Monitoring
已被标记为三个令牌 cor
, interface
和监视
。
为了搜索任何字符串的子字符串,您需要创建一个自定义分析器利用 ngram令牌过滤器为了也索引你的每个令牌的所有子字符串。
你可以这样创建你的索引:
curl -XPUT localhost:9200 / process_test_3 -d'{
settings:{
analysis:{
analyzer:{
子串_analyzer:{
tokenizer:standard,
filter:[smallcase,substring]
}
},
:{
substring:{
type:nGram,
min_gram:2,
max_gram:15
}
mappings:{
14:{
properties:{
title
type:string,
analyzer:substring_analyzer
}
}
}
}
}'
然后,您可以重新索引您的数据。这样做是标题 Cor Interface Monitoring
现在将被标记为:
-
co </ code>,
cor
,或
-
in
,int
,inte
inter
,interf
等等 -
mo
,mon
,moni
等
,以便您的第二个搜索查询现在将返回您期望的文档,因为令牌 cor
和 inter
现在匹配。
I'm sending this request
curl -XGET 'host/process_test_3/14/_search' -d '{
"query" : {
"query_string" : {
"query" : "\"*cor interface*\"",
"fields" : ["title", "obj_id"]
}
}
}'
And I'm getting correct result
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 5.421598,
"hits": [
{
"_index": "process_test_3",
"_type": "14",
"_id": "141_dashboard_14",
"_score": 5.421598,
"_source": {
"obj_type": "dashboard",
"obj_id": "141",
"title": "Cor Interface Monitoring"
}
}
]
}
}
But when I want to search by word part, as example
curl -XGET 'host/process_test_3/14/_search' -d '
{
"query" : {
"query_string" : {
"query" : "\"*cor inter*\"",
"fields" : ["title", "obj_id"]
}
}
}'
I'm getting no results back:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : []
}
}
What am I doing wrong?
This is because your title
field has probably been analyzed by the standard analyzer (default setting) and the title Cor Interface Monitoring
has been tokenized as the three tokens cor
, interface
and monitoring
.
In order to search any substring of words, you need to create a custom analyzer which leverages the ngram token filter in order to also index all substrings of each of your tokens.
You can create your index like this:
curl -XPUT localhost:9200/process_test_3 -d '{
"settings": {
"analysis": {
"analyzer": {
"substring_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "substring"]
}
},
"filter": {
"substring": {
"type": "nGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"14": {
"properties": {
"title": {
"type": "string",
"analyzer": "substring_analyzer"
}
}
}
}
}'
Then you can reindex your data. What this will do is that the title Cor Interface Monitoring
will now be tokenized as:
co
,cor
,or
in
,int
,inte
,inter
,interf
, etcmo
,mon
,moni
, etc
so that your second search query will now return the document you expect because the tokens cor
and inter
will now match.
这篇关于弹性搜索查询字符串不要按字部分搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!