弹性搜索的模糊查询 [英] ElasticSearch's Fuzzy Query

查看:124
本文介绍了弹性搜索的模糊查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是全新的ElasticSearch ,目前正在探索其功能。其中一个我感兴趣的是我正在测试并且有麻烦使用的模糊查询。这可能是一个虚拟的问题,所以我猜想已经使用这个功能的人会很快找到答案,至少我希望。 :)



BTW我有这种感觉,它可能不仅仅与ElasticSearch 相关,而是直接与Lucene 我们从一个名为第一索引的新索引开始,其中我存储一个值为美式橄榄球的对象标签。这是我使用的查询。

  bash-3.2 $ curl -XPOST'http:// localhost:9200 / firstindex / node /?pretty = true'-d'{
node:{
label:美式足球
}
}
'

这是我得到的结果。

  {
ok:true,
_index:firstindex,
_type:node,
_id 6TXNrLSESYepXPpFWjpl1A,
_version:1
}

现在我想使用模糊查询找到这个条目。这是我发送的:

  bash-3.2 $ curl -XGET'http:// localhost:9200 / firstindex / node / _search?pretty = true'-d'{
query:{
fuzzy:{
label:{
value:american football ,
boost:1.0,
min_similarity:0.0,
prefix_length:0
}
}
}
}
'

这是我得到的结果

  {
take:15,
timed_out:false,
_shards:{
总共:5,
success:5,
failed:0
},
hits:{
total:0,
max_score:null,
hits:[]
}
}

如你所见,没有命中。但是现在,当我将我的查询的价值从美国足球缩小到美国 footb ,如下所示:

  bash-3.2 $ curl -XGET'http:// localhost:9200 / firstindex / node / _search?pretty = true'-d'{
query {
fuzzy:{
label:{
value:american footb,
boost:1.0,
min_similarity 0.0,
prefix_length:0
}
}
}
}
'

然后我得到一个正确的打入我的条目,结果是:

  {
take:0,
timed_out:false,
_shards:{
total:5,
:5,
failed:0
},
hits:{
total:1,
max_score:0.19178301,
hits:[{
_index:firstindex,
_type:node,
_id:6TXNrLSESYepXPpFWjpl1A,
_分数:0.19178301_source:{
node:{
label:美式足球
}
}
}]
}
}






所以,我有几个与此测试相关的问题:


  1. 为什么在执行一个值完全相等的查询时没有得到任何结果我唯一的条目美国足球


  2. 是否与我有多个字的值相关? p>


  3. 有没有办法在我的查询结果中获得相似度分数,所以我可以更好地了解如何找到我的模糊查询的正确阈值


  4. 在ElasticSearch网站上有一个专门针对模糊查询的页面,但我不确定它列出了我可以使用的所有潜在参数模糊查询我可以找到这样详尽的列表吗?


  5. 实际上其他查询的问题相同。


  6. 使用lucene语法获得模糊匹配,使用模糊查询和查询字符串查询有区别吗?



解决方案

1。



模糊查询以术语操作。它不能处理短语,因为它不分析文本。所以,在你的例子中,弹性搜索试图将美式橄榄球一词与美国这个词和足球术语相匹配。条款之间的匹配基于 Levenshtein距离,用于计算相似度分数。由于您有min_similarity = 0.0,只要编辑距离小于最小项的大小,任何术语都应与任何术语匹配。在你的情况下,美式足球一词的大小为17,术语美国的大小为8.这两个术语之间的距离是9,大于最小项8的大小。因此,期限被拒绝。 美式足球和美国之间的编辑距离是6,基本上是美国这个词,最后有6个加法。这就是为什么它产生结果。使用min_similarity = 0.0几乎任何与编辑距离7以下的任何东西将匹配。例如,您甚至会在搜索aqqqqqq时获得结果。是的,正如我上面所解释的,它与多字值有关。如果您想搜索多个术语,请查看模糊喜欢此查询文本查询



4& 5。



通常,elasticsearch.org之后的下一个最佳信息来源是弹性搜索源代码。


I am brand new to ElasticSearch, and am currently exploring its features. One of them I am interested in is the Fuzzy Query, which I am testing and having troubles to use. It is probably a dummy question so I guess someone who already used this feature will quickly find the answer, at least I hope. :)

BTW I have the feeling that it might not be only related to ElasticSearch but maybe directly to Lucene.

Let's start with a new index named "first index" in which I store an object "label" with value "american football". This is the query I use.

bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?pretty=true' -d '{
  "node" : {
    "label" : "american football"
  }
}
'

This is the result I get.

{
  "ok" : true,
  "_index" : "firstindex",
  "_type" : "node",
  "_id" : "6TXNrLSESYepXPpFWjpl1A",
  "_version" : 1
}

So far so good, now I want to find this entry using a fuzzy query. This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d '{
  "query" : {
    "fuzzy" : {
      "label" : {
        "value" : "american football",
        "boost" : 1.0,
        "min_similarity" : 0.0,
        "prefix_length" : 0
      }                       
    }    
   }   
}
'

And this is the result I get

{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

As you can see, no hit. But now, when I shrink a bit my query's value from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d ' {
  "query" : {
    "fuzzy" : {
      "label" : {
        "value" : "american footb",
        "boost" : 1.0,
        "min_similarity" : 0.0,
        "prefix_length" : 0
      }
    }
  }
}
'

Then I get a correct hit on my entry, thus the result is:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "firstindex",
      "_type" : "node",
      "_id" : "6TXNrLSESYepXPpFWjpl1A",
      "_score" : 0.19178301, "_source" : {
        "node" : {
          "label" : "american football"
        }
      }
    } ]
  }
}


So, I have several questions related to this test:

  1. Why I didn't get any result when performing a query with a value completely equals the my only entry "american football"

  2. Is it related to the fact that I have a multi-words value?

  3. Is there a way to get the "similarity" score in my query result so I can understand better how to find the right threshold for my fuzzy queries

  4. There is a page dedicated to Fuzzy Query on ElasticSearch web site, but I am not sure it lists all the potential parameters I can use for the fuzzy query. Were could I find such an exhaustive list?

  5. Same question for the other queries actually.

  6. is there a difference between a Fuzzy Query and a Query String Query using lucene syntax to get fuzzy matching?

解决方案

1.

The fuzzy query operates on terms. It cannot handle phrases because it doesn't analyze the text. So, in your example, elasticsearch tries to match the term "american football" to the term american and to the term football. The match between terms is based on Levenshtein distance, which is used to calculate similarity score. Since you have min_similarity=0.0 any term should match any term as long as edit distance is smaller than the size of the smallest term. In your case, the term "american football" has size 17 and the term "american" has size 8. The distance between these two terms is 9 which is bigger than the size of the smallest term 8. So, as a result, this term is getting rejected. The edit distance between "american footb" and "american" is 6. It's basically the term "american" with 6 additions at the end. That's why it produces results. With min_similarity=0.0 pretty much anything with edit distance 7 or less will match. You will even get results while searching for "aqqqqqq", for example.

2.

Yes, as I explained above, it is somewhat related to multi-word values. If you want to search for multiple terms, take a look at Fuzzy Like This Query and fuzziness parameter of Text Query

4 & 5.

Usually, the next best source of information after elasticsearch.org is elasticsearch source code.

这篇关于弹性搜索的模糊查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆