在 pandas 中合并两个数据框时如何“模糊"匹配字符串 [英] how to 'fuzzy' match strings when merge two dataframe in pandas

查看:117
本文介绍了在 pandas 中合并两个数据框时如何“模糊"匹配字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框df1df2.

df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
                     'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John  Weber', 'Ian Ford'],
                     'gender': ['M', 'F', 'M', 'M']})

我需要在列Name上将这两个数据框与pandas.merge结合在一起.但是,您注意到,两个数据帧的列Name之间有些细微的差异.假设他们是同一个人.如果我只是这样做:

I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:

pd.merge(df1, df2, how='inner', on='Name')

我只得到了只有一行的数据帧,即伊恩·福特".

I only got a dataframe back with only one row, which is 'Ian Ford'.

有人知道如何合并这两个数据框吗?我猜如果在字符串列上连接两个表,这是很常见的情况.我绝对不知道该如何处理.在此先多谢.

Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.

推荐答案

我在这里使用fuzzywuzzy

from fuzzywuzzy import fuzz
from fuzzywuzzy import process



df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])

df2.merge(df1,left_on='key',right_on='Name')
Out[1238]: 
        Name_x gender         key  Age      Name_y
0   adam Smith      M  Adam Smith   43  Adam Smith
1    Annie Kim      F    Anne Kim   21    Anne Kim
2  John  Weber      M  John Weber   55  John Weber
3     Ian Ford      M    Ian Ford   24    Ian Ford

这篇关于在 pandas 中合并两个数据框时如何“模糊"匹配字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆