通过DataFrame进行模糊匹配和迭代 [英] Fuzzy matching and iteration through DataFrame

查看:105
本文介绍了通过DataFrame进行模糊匹配和迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这两个DataFrame:我想将Surname字符串与相应的Names模糊匹配

I have these two DataFrames: I want to fuzzy match the Surname strings to the corresponding Names

dico = {'Name': ['Arthur','Henri','Lisiane','Patrice'],
        "Age": ["20","18","62","73"],
        "Studies": ['Economics','Maths','Psychology','Medical']
             }
dico2 = {'Surname': ['Henri2','Arthur1','Patrice4','Lisiane3']}

dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)

我想将姓氏字符串与相应的名称进行模糊匹配,以产生如下输出

I want to fuzzy match the Surname strings to the corresponding Names to have an output as follows

      Name   Surname Age     Studies
0   Arthur   Arthur1  20   Economics
1    Henri    Henri2  18       Maths
2  Lisiane  Lisiane3  62  Psychology
3  Patrice  Patrice4  73     Medical

这是到目前为止的代码:

and here is my code so far:

dico['Surname'] = []
for i in dico2:
    lst = [0, 0, 0]
    for j in dico:
        if lst[0] < fuzz.ratio(i,j):
            lst[0] = fuzz.ratio(i,j)
            lst[1] = i
            lst[2] = j
    dico['Surname'].append(i)

但是我得到一个 ValueError:值(0)的长度与索引(4)的长度不匹配,我不明白为什么.谢谢!

but i get a ValueError: Length of values (0) does not match length of index (4), which I don't get why. Thanks !

推荐答案

错误源于

dico['Surname'] = []

dico ['Surname'] 的长度为4,而 [] 的长度为0.您可以在列表中收集姓氏,然后将姓氏添加到列表中.数据帧在循环之后执行一次.

dico['Surname'] is length 4, while [] is length 0. You can instead collect your surnames in a list and then add the surnames to the dataframe in one go after the loop.

您还需要告诉外循环遍历 dico2 ['Surname'] 而不是整个数据帧.

You also need to tell the outer loop to iterate over dico2['Surname'] instead of the entire dataframe.

surnames = []
for i in dico2['Surname']:
    lst = [0, 0, 0]
    for j in dico:
        if lst[0] < fuzz.ratio(i,j):
            lst[0] = fuzz.ratio(i,j)
            lst[1] = i
            lst[2] = j
    surnames.append(i)
    
dico['Surname'] = surnames

仅修复了所讨论的错误.另请参见maxbachmann的建议,不要两次调用 fuzz.ratio .

only fixed the error in question. Also see maxbachmann's advise on not calling fuzz.ratio twice.

这篇关于通过DataFrame进行模糊匹配和迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆