使用Levenshtein距离匹配的百分比匹配排名 [英] Percentage rank of matches using Levenshtein Distance matching

查看:106
本文介绍了使用Levenshtein距离匹配的百分比匹配排名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Levenshtein距离算法将单个搜索词与可能匹配的字典进行匹配。该算法返回的距离表示为将搜索字符串转换为匹配字符串所需的操作数。
我想将结果显示在排名靠前的 N(例如10)匹配项的百分比列表中。

I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as number of operations required to convert the search string into the matched string. I want to present the results in ranked percentage list of top "N" (say 10) matches.

由于搜索字符串可以长于或短于单个字典字符串,因此以百分比表示距离的合适逻辑将定性地表明距离有多近%是查询字符串的每个结果,其中100%表示完全匹配。

Since the search string can be longer or shorter than the individual dictionary strings, what would be an appropriate logic for expressing the distance as a percentage, which would qualitatively refelct how close "as a percentage" is each result to the query string, with 100% indicating an exact match.

我考虑了以下选项:

Q = query string
M = matched string
PM = Percentage Match
Option 1. PMi = (1 - Lev_distance(Q, Mi)/Strlen(Q)) * 100
Option 2. PMi = (1 - Lev_distance(Q, Mi)/max(Strlen(Q), strlen(Mi))) * 100

在距离大于搜索字符串长度的情况下,如果匹配字符串为长。例如,查询 ABC与 ABC Corp.匹配

Option 1 has the possibility of negative percentages in case the distance is greater than the search string length, where the match string is long. For example query "ABC" matched with "ABC Corp." would result in a negative match percent.

选项2在整个Mi集中似乎没有给出一致的百分比,因为每次计算都可能使用不同的分母,因此

Option 2 does not appear to give a consistent percentage across a set of Mi, as each calculation would possibly use a different denominator and hence the resulting percentage values would not be normalized.

我只能想到的另一种方法是放弃将lev_distance与任一字符串长度的比较,而只给出比较距离前N个匹配项的百分率排名倒数(100%)。

Only other way I can think of is to ditch the comparison of the lev_distance to either string lenghts, but instead present the comparitive distances of the top "N" matches as an inverse percentile rank (100-percentile-rank).

有什么想法?有更好的方法吗?我必须丢失一些东西,因为Levenshtein距离可能是模糊匹配最常用的算法,而且这肯定是一个非常常见的问题。

Any thoughts? Are there better approaches? I must be missing something as Levenshtein distance is probably the most common algorithm for fuzzy matches and this must be a very common problem.

推荐答案

我遇到了类似的问题,该线程帮助我找到了解决方案。希望它也能帮助其他人。

I had a similar problem and this thread helped me to figure out a solution. Hope it can help others too.

int levDis = Lev_distance(Q, Mi)
int bigger = max(strlen(Q), strlen(Mi))
double pct = (bigger - levDis) / bigger

如果两个字符串完全相同,则返回100%,如果完全不同,则返回0%。

It should return 100% if both strings are exactly the same and 0% if they are totaly different.

(对不起,如果我的英语不太好)

(sorry if my english isn't that good)

这篇关于使用Levenshtein距离匹配的百分比匹配排名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆