Jaro-Winkler和Levenshtein距离之间的区别? [英] Difference between Jaro-Winkler and Levenshtein distance?

查看:871
本文介绍了Jaro-Winkler和Levenshtein距离之间的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用例,需要对多个文件中的数百万条记录进行模糊匹配。我为此确定了两种算法: Jaro-Winkler Levenshtein 编辑距离。

I have a use case where I need to do fuzzy matching of millions of records from multiple files. I identified two algorithms for that: Jaro-Winkler and Levenshtein edit distance.

当我开始探索两者时,我无法理解两者之间的确切区别。看起来Levenshtein给出了两个字符串之间的编辑次数,而Jaro-Winkler给出了0.0到1.0之间的匹配分数。我不了解该算法。由于我需要使用任何一种算法,因此我需要知道在算法性能方面的确切差异。

When I started exploring both, I was not able to understand what the exact difference is between the two. It seems Levenshtein gives the number of edits between two strings, and Jaro-Winkler gives a matching score between 0.0 to 1.0. I didn't understand the algorithm. As I need to use either algorithm, I need to know the exact differences with respect to algorithm performance.

推荐答案

Levenshtein会计算数字将一个字符串转换为另一字符串所需的编辑(插入,删除或替换)操作。 Damerau-Levenshtein是修改后的版本,也将换位视为单个编辑。尽管输出是编辑的整数,但是可以通过公式

Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula

1 - (edit distance / length of the larger of the two strings)

Jaro算法是常用字符的量度,不超过较长字符串的长度的一半(考虑换位)。 Winkler修改了此算法,以支持以下观点:字符串开头附近的差异比字符串结尾附近的差异更重要。 Jaro和Jaro-Winkler适合比较诸如单词和名称之类的较小字符串。

The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.

决定使用哪种字符不仅取决于性能。选择适合您所比较字符串性质的方法很重要。但总的来说,您提到的两种算法都可能很昂贵,因为每个字符串都必须与其他每个字符串进行比较,并且数据集中包含数百万个字符串,因此比较的数量很多。这比诸如为每个字符串计算语音编码,然后简单地将共享相同编码的字符串分组等要昂贵得多。

Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.

这些算法有大量详细信息以及互联网上的其他模糊字符串匹配算法。这将为您提供一个开始:

There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:

人名
匹配的比较:技巧和实用
问题

根据该论文,我提到的四种Jaro和Levenshtein算法的速度从最快到最慢:

According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:


  • Jaro

  • Jaro-Winkler

  • Levenshtein

  • Damerau-Levenshtein

  • Jaro
  • Jaro-Winkler
  • Levenshtein
  • Damerau-Levenshtein

是最快的2到3倍。当然,这些时间取决于字符串的长度和实现,并且有一些方法可以优化可能没有使用的算法。

with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.

这篇关于Jaro-Winkler和Levenshtein距离之间的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆