ElasticSearch 的模糊查询 [英] ElasticSearch's Fuzzy Query

查看:64
本文介绍了ElasticSearch 的模糊查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 ElasticSearch 的新手,目前正在探索其功能.我感兴趣的其中之一是模糊查询,我正在测试它并且在使用时遇到了麻烦.这可能是一个愚蠢的问题,所以我猜已经使用过这个功能的人会很快找到答案,至少我希望如此.:)

I am brand new to ElasticSearch, and am currently exploring its features. One of them I am interested in is the Fuzzy Query, which I am testing and having troubles to use. It is probably a dummy question so I guess someone who already used this feature will quickly find the answer, at least I hope. :)

顺便说一句,我觉得它可能不仅与 ElasticSearch 有关,而且可能与 Lucene 直接相关.

BTW I have the feeling that it might not be only related to ElasticSearch but maybe directly to Lucene.

让我们从一个名为first index"的新索引开始,我在其中存储了一个值为美式足球"的对象标签".这是我使用的查询.

Let's start with a new index named "first index" in which I store an object "label" with value "american football". This is the query I use.

bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?pretty=true' -d '{
  "node" : {
    "label" : "american football"
  }
}
'

这是我得到的结果.

{
  "ok" : true,
  "_index" : "firstindex",
  "_type" : "node",
  "_id" : "6TXNrLSESYepXPpFWjpl1A",
  "_version" : 1
}

到目前为止一切顺利,现在我想使用模糊查询找到这个条目.这是我发送的:

So far so good, now I want to find this entry using a fuzzy query. This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d '{
  "query" : {
    "fuzzy" : {
      "label" : {
        "value" : "american football",
        "boost" : 1.0,
        "min_similarity" : 0.0,
        "prefix_length" : 0
      }                       
    }    
   }   
}
'

这是我得到的结果

{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

如您所见,没有命中.但是现在,当我将查询的值从american football"缩小到american footb"时,如下所示:

As you can see, no hit. But now, when I shrink a bit my query's value from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d ' {
  "query" : {
    "fuzzy" : {
      "label" : {
        "value" : "american footb",
        "boost" : 1.0,
        "min_similarity" : 0.0,
        "prefix_length" : 0
      }
    }
  }
}
'

然后我在我的条目中得到了正确的命中,因此结果是:

Then I get a correct hit on my entry, thus the result is:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "firstindex",
      "_type" : "node",
      "_id" : "6TXNrLSESYepXPpFWjpl1A",
      "_score" : 0.19178301, "_source" : {
        "node" : {
          "label" : "american football"
        }
      }
    } ]
  }
}

<小时>

所以,我有几个与此测试相关的问题:


So, I have several questions related to this test:

  1. 为什么我在执行查询时没有得到任何结果,其值完全等于我唯一的条目美国足球"

这与我有一个多字值的事实有关吗?

Is it related to the fact that I have a multi-words value?

有没有办法在我的查询结果中获得相似性"分数,这样我就可以更好地理解如何为我的模糊查询找到正确的阈值

Is there a way to get the "similarity" score in my query result so I can understand better how to find the right threshold for my fuzzy queries

ElasticSearch 网站上有一个专用于模糊查询的页面,但我不确定它是否列出了我可以用于模糊查询的所有潜在参数.我能找到这么详尽的清单吗?

There is a page dedicated to Fuzzy Query on ElasticSearch web site, but I am not sure it lists all the potential parameters I can use for the fuzzy query. Were could I find such an exhaustive list?

实际上其他查询的相同问题.

Same question for the other queries actually.

Fuzzy QueryQuery String Query 使用 lucene 语法进行模糊匹配有区别吗?

is there a difference between a Fuzzy Query and a Query String Query using lucene syntax to get fuzzy matching?

推荐答案

1.

模糊查询对术语进行操作.它无法处理短语,因为它不分析文本.因此,在您的示例中,elasticsearch 尝试匹配术语美式足球";美国人和足球这个词.术语之间的匹配基于Levenshtein distance,用于计算相似度得分.由于您有 min_similarity=0.0,只要编辑距离小于最小术语的大小,任何术语都应与任何术语匹配.在你的情况下,美式足球"这个词很重要.有尺寸 17 和术语美国"有大小 8.这两个项之间的距离是 9,这比最小项 8 的大小要大.因此,这个项被拒绝了.美国足"之间的编辑距离和美国人"是 6.基本上是美国人"这个词.最后增加了6个.这就是它产生结果的原因.使用 min_similarity=0.0 时,几乎所有编辑距离为 7 或更小的内容都将匹配.例如,您甚至可以在搜索aqqqqqq"时获得结果.

1.

The fuzzy query operates on terms. It cannot handle phrases because it doesn't analyze the text. So, in your example, elasticsearch tries to match the term "american football" to the term american and to the term football. The match between terms is based on Levenshtein distance, which is used to calculate similarity score. Since you have min_similarity=0.0 any term should match any term as long as edit distance is smaller than the size of the smallest term. In your case, the term "american football" has size 17 and the term "american" has size 8. The distance between these two terms is 9 which is bigger than the size of the smallest term 8. So, as a result, this term is getting rejected. The edit distance between "american footb" and "american" is 6. It's basically the term "american" with 6 additions at the end. That's why it produces results. With min_similarity=0.0 pretty much anything with edit distance 7 or less will match. You will even get results while searching for "aqqqqqq", for example.

是的,正如我上面解释的,它与多字值有些相关.如果要搜索多个术语,请查看 Fuzzy Like This Query文本查询

Yes, as I explained above, it is somewhat related to multi-word values. If you want to search for multiple terms, take a look at Fuzzy Like This Query and fuzziness parameter of Text Query

通常,elasticsearch.org 之后的下一个最佳信息来源是 elasticsearch 源代码.

Usually, the next best source of information after elasticsearch.org is elasticsearch source code.

这篇关于ElasticSearch 的模糊查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆