Jaro-winkler函数:为什么相同的分数匹配单词非常相似和非常不同? [英] Jaro-winkler function: why is the same score matching very similar and very different words?

查看:301
本文介绍了Jaro-winkler函数:为什么相同的分数匹配单词非常相似和非常不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用jaro-winkler模糊匹配来匹配名称.

I am using the jaro-winkler fuzzy matching to match names.

我正在尝试确定相似性得分的临界范围.如果名称太不同,我想排除它们以进行手动检查.

I am trying to determine a cut-off range for the similarity score. If the names are too different, I want to exclude them for manual review.

.4以下的内容似乎完全是不同的名称,.4范围似乎相当相似.

While anything below .4 seemed to be different names entirely, the .4 range seemed fairly similar.

但是随后我遇到了奇怪的异常,其中该范围内的某些名称完全不同,而某些名称仅相差一个或两个字母(请参见下面的示例).

But then I came across strange exceptions, where some names in that range are entirely different, while some names are only one or two letters off(see example below).

有人可以解释在相同匹配分数范围内匹配的广泛差异吗?

   Estrella     ANNELISE    0.42 
   Arienna      IREANNA     0.43 
   Tayvia       I TAYVIA    0.43
   Amanda       IZABEL      0.44
   Hunter       JOSHUA      0.44
   Ryder        CHARLES     0.45
   Luis         ELIZABETH   0.45 
   Sebastian    JOSE        0.45 
   Christopher  CHISTOPHE   0.46 
   Genayunique  GENAY-UNI   0.46 
   Andreeaonn   ADREEAONN   0.46
   Chistopher   CHRISTOPH   0.46
   Dazharicon   DAZHARION   0.46
   Jennavecia   JENNACVEC   0.46
   Valentiria   VALENTINA   0.46
   Abel         SAMMUEL     0.46
   Dezarea MarieDEZAREA     0.47
   Alexander    ALEXZANDE   0.47

推荐答案

Jaro-Winkler距离公式偏向具有共同开头的字符串.例如,Valenti na 和Valenti ria .

The Jaro-Winkler distance formula is biased towards strings with a common beginning. For example, Valentina and Valentiria.

它也有一些不太直观的规则"(请参见维基百科).

It also has some not so intuitive "rules" (see wikipedia).

您可能应该首先确定所期望的相异性,然后寻找合适的距离公式.例如,在写作中,"angleworm"和"angelworm"是一个很可能的错误,因此两个字符串之间的距离应较小.虽然不匹配有"和三个"的可能性较小,而以太"的匹配可能性更大.如果使用更长的字谜,那么Jaro距离可能会完全相同,甚至Winkler校正也可能不会生效.

You should probably first determine what kind of dissimilarity you are expecting, and then looking for a suitable distance formula. For example, in writing, "angleworm" and "angelworm" is a very likely error, so the distance between the two strings ought to be low. While mismatching "there" and "three" is less likely and "ether" even more so. With longer anagrams, the Jaro distance might be exactly the same, and even the Winkler correction might not kick in.

您可以在此页中阅读(重点是我的内容)

As you can read in this page (emphasis mine)

除了对空字符串和完全相同的字符串进行优化之外,您还可以在这里看到我对第一个字符的权重更大.这是因为我的数据最初很繁琐.

为补偿频繁使用中间名首字母的情况,我将Jaro-Winkler距离算作得分的80%,而其余20%完全基于第一个字符匹配. p的值由大量实验和拔毛的结果确定.在进行此扩展名之前,缩写通常会不正确.

To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.

这篇关于Jaro-winkler函数:为什么相同的分数匹配单词非常相似和非常不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆