Python名称仅首字母的模糊匹配 [英] Python fuzzy matching of names with only first initials

查看:143
本文介绍了Python名称仅首字母的模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在某些情况下,我需要将给定字符串中的名称与名称数据库进行匹配.下面我举了一个非常简单的例子说明我所遇到的问题,但我不清楚为什么一个案件胜过另一个案件?如果我没记错的话,extractOne()的Python默认算法是Levenshtein距离算法.是因为Clemens的名字提供了前两个首字母,而不是冈萨雷斯的名字中的一个?

I have a case where I need to match a name from a given string to a database of names. Below I have given a very simple example of the issue that I am running into, and I am unclear as to why one case works over the other? If I'm not mistaken, the Python default algorithm for extractOne() is the Levenshtein distance algorithm. Is it because the Clemens' names provide the first two initials, opposed to only one in the Gonzalez's case?

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

s = ['Gonzalez, E. walked down the street.', 'Gonzalez, R. went to the market.', 'Clemens, Ko. reach the intersection; Clemens, Ka. did not.']

names = []

for i in s:

    name = [] #clear name
    for k in i.split():
        if k[0].isupper(): name.append(k)
        else: break
    names.append(' '.join(name))

    if ';' in i:
        for each in i.split(';')[1:]:
            name = [] #clear name
            for k in each.split():
                if k[0].isupper(): name.append(k)
                else: break
            names.append(' '.join(name))

print(names)

choices = ['Kody Clemens','Kacy Clemens','Gonzalez Ryan', 'Gonzalez Eddy']

for i in names:
    s = process.extractOne(i, choices)
    print(s, i)

输出:

['Gonzalez, E.', 'Gonzalez, R.', 'Clemens, Ko.', 'Clemens, Ka.']
('Gonzalez Ryan', 85) Gonzalez, E.
('Gonzalez Ryan', 85) Gonzalez, R.
('Kody Clemens', 86) Clemens, Ko.
('Kacy Clemens', 86) Clemens, Ka.

推荐答案

尽管@Igle的评论确实解决了此特定问题,但我想强调一点,这是一个狭窄的解决方案,不一定适用于所有情况. Fuzzywuzzy具有多个计分器,这些计分器使用Levenshtein距离算法结合不同的逻辑来比较字符串.默认得分手fuzz.WRatio将直线Levenshtein距离算法(fuzz.ratio)的匹配得分与其他变体进行比较,并从所有得分手返回最佳匹配.还有更多的功能,包括围绕不同方法加权得分的其他逻辑,如果您有兴趣,我建议您查看

Although @Igle's commment does solve this specific problem, I want to stress that this is a narrow solution that won't necessarily work for everything. Fuzzywuzzy has multiple scorers that use the Levenshtein distance algorithm combined with different logic to compare strings. The default scorer, fuzz.WRatio, compares the matching score of the straight Levenshtein distance algorithm (fuzz.ratio) with other variants, and returns the best match from all of the scorers. There's more to it than just that, including additional logic around weighting the score from different methods, if you're interested I suggest looking at the source code for fuzz.WRatio.

要查看您所发生的情况,您可以通过稍微修改代码的最后几行来比较所有选择者中所有选择的得分:

To see what's happening in your case, you can compare the scores for all the choices across scorers by slightly adapting the last lines of your code:

对于token_set_ratio:

For token_set_ratio:

for i in names:
   s = process.extract(i, choices,scorer=fuzz.token_set_ratio)
   print(s, i)

[('Gonzalez Ryan', 89), ('Gonzalez Eddy', 89), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, E.
[('Gonzalez Ryan', 89), ('Gonzalez Eddy', 89), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Ryan', 26), ('Gonzalez Eddy', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.

对于token_sort_ratio:

For token_sort_ratio:

for i in names:
   s = process.extract(i, choices,scorer=fuzz.token_sort_ratio)
   print(s, i)

[('Gonzalez Eddy', 87), ('Gonzalez Ryan', 70), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, E.
[('Gonzalez Ryan', 87), ('Gonzalez Eddy', 70), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Ryan', 26), ('Gonzalez Eddy', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.

尽管token_sort_ratio显示出明显的获胜比赛,但token_set_ratio返回的分数更高,这就是fuzz.WRatio选择返回的结果的方式.另一个主要问题是,当您具有类似的查询和选择时,比较它们的顺序就变得很重要.例如,当我运行与上面完全相同的代码,但颠倒选择列表的顺序时,我们得到的都是"Gonzalez Eddy":

Although token_sort_ratio shows a clear winning match, token_set_ratio returns higher scores which is how fuzz.WRatio picks what result it returns. Another major issue is that when you have such similar queries and choices, the order in which they are compared starts to matter. For example, when I run the exact same code as above, but reverse the order of the choices list we get 'Gonzalez Eddy' for both:

for i in names:
   s = process.extract(i, choices[::-1],scorer=fuzz.token_set_ratio)
   print(s, i)
[('Gonzalez Eddy', 89), ('Gonzalez Ryan', 89), ('Kacy Clemens', 27), ('Kody Clemens', 27)] Gonzalez, E.
[('Gonzalez Eddy', 89), ('Gonzalez Ryan', 89), ('Kacy Clemens', 27), ('Kody Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Eddy', 26), ('Gonzalez Ryan', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.

我猜想正确的比赛得分实际上更高,但是"Eddy"和"Ryan"足够接近以至于最终得分都相同.

I'm guessing that the correct match actually has a higher score, but 'Eddy' and 'Ryan' are close enough to both round to the same final score.

过去我处理类似问题的方式:

Ways I've dealt with similar issues in the past:

  1. 使用extract而不是extractOne(就像我在上面的示例中所做的那样)
  2. 使用多个计分器(比率,token_set_ratio,token_sort_ratio)处理相同的查询/选择,并使用这些分数的加权平均值来选择最佳匹配项.
  3. 调整Fuzzywuzzy源代码以合并自定义权重或删除舍入.

这篇关于Python名称仅首字母的模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆