我应该在模糊查询字段中包含空格吗? [英] Should I include spaces in fuzzy query fields?

查看:1318
本文介绍了我应该在模糊查询字段中包含空格吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个数据:

  name:
first:'John'
last:'史密斯'

当我将它存储在ES中时,AFAICT最好使它成为一个字段。但是,如果这个字段是:

  name:'John Smith'
/ pre>

 名称:'JohnSmith'



我认为查询应该是:

 查询:
匹配:
名称:
查询:searchTerm
模糊性:'AUTO'
运算符:'和'

示例搜索条件是什么人可能会输入搜索框,例如

  John 
Jhon Smi
J Smith
史密斯

等。

解决方案

您可能需要 ngrams 和一个模糊匹配查询。如果您需要一个底稿,我为Qbox写了一篇有关ngram的博文: http: //blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch 。我会在帖子结尾处刷新入门代码,以说明我在这里的意思。



此外,我不认为你使用两个字段对于名称,或仅一个。如果您有其他原因需要两个字段,则可能需要使用 _all 字段。为了简单起见,我将在这里使用一个单一的字段。



这是一个映射,可以让您获得所需的部分字匹配,假设您只关心令牌从单词开头开始(否则请使用 ngrams 而不是边缘图 )。使用ngram有很多细微差别,所以如果你想要更多的信息,请参考你的文档和我的底稿。

  PUT / test_index 
{
settings:{
number_of_shards:1,
analysis:{
filter:{
edge_ngram_filter:{
type:edge_ngram,
min_gram:1,
max_gram:10
}
},
分析器:{
edge_ngram_analyzer:{
type:custom,
tokenizer:standard,
filter:[
小写,
edge_ngram_filter
]
}
}
}
},
映射:{
doc:{
properties:{
name:{
type:string,
index_analyzer:edge_ngram_an alyzer,
search_analyzer:standard
}
}
}
}
}

这里要注意的一点是:min_gram:1 。这意味着将从索引值生成单字符令牌。当您查询(例如,以j开头的许多单词))时,这将会投放一个相当宽的网络,因此您可能会获得一些意想不到的结果,尤其是与模糊性相结合。但是,这需要让您的J史密斯查询正常工作。所以有一些取舍要考虑。



为了说明,我索引了四个文件:

  PUT / test_index / doc / _bulk 
{index:{_ id:1}}
{name:John Hancock}
{ index:{_ id:2}}
{name:John Smith}
{index:{_ id:3}}
{name :Bob Smith}
{index:{_ id:4}}
{name:Bob Jones}
/ pre>

您的查询主要适用于几个注意事项。

  POST / test_index / _search 
{
query:{
match:{
name:{
query ,
fuzziness:AUTO,
operator:和
}
}
}
}

此查询返回三个文档,因为ngram加上模糊性:

  {
take:3,
timed_out:false,
_shards:{
total:1,
成功:1,
failed:0
},
hits:{
total:3,
max_score:0.90169895,
hits [
{
_index:test_index,
_type:doc,
_id:1,
_score 0.90169895
_source:{
name:John Hancock
}
},
{
_index:test_index ,
_type:doc,
_id:2,
_score:0.90169895,
_source:{
:John Smith
}
},
{
_index:test_index,
_type:doc,
_id:4,
_score:0.6235822,
_source:{
name:Bob Jones
}
}
]
}
}

是你想要的另外,AUTO与Jhon Smi查询无效,因为Jhon与John的编辑距离为2,AUTO使用对于3-5个字符的字符串,编辑距离为1(请参阅 docs 获取更多信息)。所以我必须使用这个查询:

  POST / test_index / _search 
{
query :{
match:{
name:{
query:Jhon Smi,
fuzziness:2,
operator :和
}
}
}
}
...
{
带:17,
timed_out:false,
_shards:{
total:1,
success:1,
failed:0
}
hits:{
total:1,
max_score:1.4219328,
hits:[
{
_index :test_index,
_type:doc,
_id:2,
_score:1.4219328,
_source:{
name:John Smith
}
}
]
}
}

其他查询按预期工作。所以这个解决方案并不完美,但会让你关闭。



这是我使用的所有代码:



http://sense.qbox.io/gist/ba5a6741090fd40c1bb20f5d36f3513b4b55ac77


I have this data:

name:
  first: 'John'
  last: 'Smith'

When I store it in ES, AFAICT it's better to make it one field. However, should this one field be:

name: 'John Smith'

or

name: 'JohnSmith'

?

I'm thinking that the query should be:

query: 
  match: 
    name: 
      query: searchTerm
      fuzziness: 'AUTO'
      operator: 'and'

Example search terms are what people might type in a search box, like

John
Jhon Smi
J Smith
Smith

etc.

解决方案

You will probably want a combination of ngrams and a fuzzy match query. I wrote a blog post about ngrams for Qbox if you need a primer: http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch. I'll swipe the starter code at the end of the post to illustrate what I mean here.

Also, I don't think it matters much whether you use two fields for name, or just one. If you have some other reason you want two fields, you may want to use the _all field in your query. For simplicity I'll just use a single field here.

Here is a mapping that will get you the partial-word matching you want, assuming you only care about tokens that start at the beginning of words (otherwise use ngrams instead of edge ngrams). There are lots of nuances to using ngrams, so I'll refer to you the documentation and my primer if you want more info.

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "edge_ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 1,
               "max_gram": 10
            }
         },
         "analyzer": {
            "edge_ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "edge_ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "name": {
               "type": "string",
               "index_analyzer": "edge_ngram_analyzer",
               "search_analyzer": "standard"
            }
         }
      }
   }
}

One thing to note here, in particular: "min_gram": 1. This means that single-character tokens will be generated from indexed values. This will cast a pretty wide net when you query (lots of words begin with "j", for example), so you may get some unexpected results, especially when combined with fuzziness. But this is needed to get your "J Smith" query to work right. So there are some trade-offs to consider.

For illustration, I indexed four documents:

PUT /test_index/doc/_bulk
{"index":{"_id":1}}
{"name":"John Hancock"}
{"index":{"_id":2}}
{"name":"John Smith"}
{"index":{"_id":3}}
{"name":"Bob Smith"}
{"index":{"_id":4}}
{"name":"Bob Jones"}

Your query mostly works, with a couple of caveats.

POST /test_index/_search
{
    "query": {
        "match": {
           "name": {
               "query": "John",
               "fuzziness": "AUTO",
               "operator": "and"
           }
        }
    }
}

this query returns three documents, because of ngrams plus fuzziness:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 0.90169895,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.90169895,
            "_source": {
               "name": "John Hancock"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.90169895,
            "_source": {
               "name": "John Smith"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "4",
            "_score": 0.6235822,
            "_source": {
               "name": "Bob Jones"
            }
         }
      ]
   }
}

That may not be what you want. Also, "AUTO" doesn't work with the "Jhon Smi" query, because "Jhon" is an edit distance of 2 from "John", and "AUTO" uses an edit distance of 1 for strings of 3-5 characters (see the docs for more info). So I have to use this query instead:

POST /test_index/_search
{
    "query": {
        "match": {
           "name": {
               "query": "Jhon Smi",
               "fuzziness": 2,
               "operator": "and"
           }
        }
    }
}
...
{
   "took": 17,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.4219328,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 1.4219328,
            "_source": {
               "name": "John Smith"
            }
         }
      ]
   }
}

The other queries work as expected. So this solution isn't perfect, but it will get you close.

Here's all the code I used:

http://sense.qbox.io/gist/ba5a6741090fd40c1bb20f5d36f3513b4b55ac77

这篇关于我应该在模糊查询字段中包含空格吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆