弹性模糊匹配max_expansions& min_similarity [英] elasticsearch fuzzy matching max_expansions & min_similarity

查看:980
本文介绍了弹性模糊匹配max_expansions& min_similarity的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在项目中使用模糊匹配,主要是找到同名的拼写错误和拼写错误。我需要准确地了解弹性搜索的模糊匹配是如何工作的,以及它如何使用标题中提到的2个参数。



据了解, min_similarity 是查询字符串与数据库中的字符串匹配的百分比。我找不到如何计算这个值的确切描述。



根据我所理解的 max_expansions 是Levenshtein距离应该执行如果这实际上是Levenshtein距离,那将是我理想的解决方案。无论如何,它不工作
,例如我有单词Samvel

  queryStr max_expansions匹配? 
samvel 0不应该是0.错误(但是levenshtein的距离可以是0!)
samvel 1是
samvvel 1是
samvvell 1是(但它不应该)
samvelll 1是(但不应该)
saamvelll 1否(但是有一些奇怪的原因,它与Samvelian匹配)
saamvelll任何大于1否

该文档说明了我实际上不明白的内容:

 将max_expansions添加到模糊查询中,允许控制匹配条件的最大数量
。默认为无界(或由
布尔查询中的最大子句计数界定)。

所以请任何人向我解释这些参数究竟如何影响搜索结果。

解决方案

min_similarity 是一个介于0和1之间的值。从Lucene文档:

 例如,对于最小相似度为0.5,与查询相同长度的条款
如果两个术语之间的编辑
距离小于长度(术语),则术语被认为与查询项相似* 0.5

所提及的编辑距离是 Levenshtein距离



此查询在内部工作的方式是:




  • 它查找所有条款存在于可以匹配搜索词的索引中,当将 min_similarity 纳入

  • 之后,搜索所有这些术语。



你可以想象这个查询可能有多沉重!



这样,您可以设置 max_expansions 参数来指定应考虑的最大匹配项数。


I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title.

As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated.

The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal solution for me. Anyway, it's not working for example i have the word "Samvel"

queryStr      max_expansions         matches?
samvel        0                      Should not be 0. error (but levenshtein distance   can be 0!)
samvel        1                      Yes
samvvel       1                      Yes
samvvell      1                      Yes (but it shouldn't have)
samvelll      1                      Yes (but it shouldn't have)
saamvelll     1                      No (but for some weird reason it matches with Samvelian)
saamvelll     anything bigger than 1 No

The documentation says something I actually do not understand:

Add max_expansions to the fuzzy query allowing to control the maximum number 
of terms to match. Default to unbounded (or bounded by the max clause count in 
boolean query).

So can please anyone explain to me how exactly these parameters affect the search results.

解决方案

The min_similarity is a value between zero and one. From the Lucene docs:

For example, for a minimumSimilarity of 0.5 a term of the same length 
as the query term is considered similar to the query term if the edit 
distance between both terms is less than length(term)*0.5

The 'edit distance' that is referred to is the Levenshtein distance.

The way this query works internally is:

  • it finds all terms that exist in the index that could match the search term, when taking the min_similarity into account
  • then it searches for all of those terms.

You can imagine how heavy this query could be!

To combat this, you can set the max_expansions parameter to specify the maximum number of matching terms that should be considered.

这篇关于弹性模糊匹配max_expansions& min_similarity的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆