在 pandas 中合并两个数据框时如何“模糊"匹配字符串 [英] how to 'fuzzy' match strings when merge two dataframe in pandas
问题描述
我有两个数据框df1
和df2
.
df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John Weber', 'Ian Ford'],
'gender': ['M', 'F', 'M', 'M']})
我需要在列Name
上将这两个数据框与pandas.merge结合在一起.但是,您注意到,两个数据帧的列Name
之间有些细微的差异.假设他们是同一个人.如果我只是这样做:
I need to join these two dataframe with pandas.merge on the column Name
. However, as you notice, there are some slight difference between column Name
from the two dataframe. Let's assume they are the same person. If I simply do:
pd.merge(df1, df2, how='inner', on='Name')
我只得到了只有一行的数据帧,即伊恩·福特".
I only got a dataframe back with only one row, which is 'Ian Ford'.
有人知道如何合并这两个数据框吗?我猜如果在字符串列上连接两个表,这是很常见的情况.我绝对不知道该如何处理.在此先多谢.
Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.
推荐答案
我在这里使用fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])
df2.merge(df1,left_on='key',right_on='Name')
Out[1238]:
Name_x gender key Age Name_y
0 adam Smith M Adam Smith 43 Adam Smith
1 Annie Kim F Anne Kim 21 Anne Kim
2 John Weber M John Weber 55 John Weber
3 Ian Ford M Ian Ford 24 Ian Ford
这篇关于在 pandas 中合并两个数据框时如何“模糊"匹配字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!