通过DataFrame进行模糊匹配和迭代 [英] Fuzzy matching and iteration through DataFrame
问题描述
我有这两个DataFrame:我想将Surname字符串与相应的Names模糊匹配
I have these two DataFrames: I want to fuzzy match the Surname strings to the corresponding Names
dico = {'Name': ['Arthur','Henri','Lisiane','Patrice'],
"Age": ["20","18","62","73"],
"Studies": ['Economics','Maths','Psychology','Medical']
}
dico2 = {'Surname': ['Henri2','Arthur1','Patrice4','Lisiane3']}
dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)
我想将姓氏字符串与相应的名称进行模糊匹配,以产生如下输出
I want to fuzzy match the Surname strings to the corresponding Names to have an output as follows
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Patrice4 73 Medical
这是到目前为止的代码:
and here is my code so far:
dico['Surname'] = []
for i in dico2:
lst = [0, 0, 0]
for j in dico:
if lst[0] < fuzz.ratio(i,j):
lst[0] = fuzz.ratio(i,j)
lst[1] = i
lst[2] = j
dico['Surname'].append(i)
但是我得到一个 ValueError:值(0)的长度与索引(4)的长度不匹配
,我不明白为什么.谢谢!
but i get a ValueError: Length of values (0) does not match length of index (4)
, which I don't get why. Thanks !
推荐答案
错误源于
dico['Surname'] = []
dico ['Surname']
的长度为4,而 []
的长度为0.您可以在列表中收集姓氏,然后将姓氏添加到列表中.数据帧在循环之后执行一次.
dico['Surname']
is length 4, while []
is length 0. You can instead collect your surnames in a list and then add the surnames to the dataframe in one go after the loop.
您还需要告诉外循环遍历 dico2 ['Surname']
而不是整个数据帧.
You also need to tell the outer loop to iterate over dico2['Surname']
instead of the entire dataframe.
surnames = []
for i in dico2['Surname']:
lst = [0, 0, 0]
for j in dico:
if lst[0] < fuzz.ratio(i,j):
lst[0] = fuzz.ratio(i,j)
lst[1] = i
lst[2] = j
surnames.append(i)
dico['Surname'] = surnames
仅修复了所讨论的错误.另请参见maxbachmann的建议,不要两次调用 fuzz.ratio
.
only fixed the error in question. Also see maxbachmann's advise on not calling fuzz.ratio
twice.
这篇关于通过DataFrame进行模糊匹配和迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!