加入两个数据框的模糊匹配 [英] Fuzzy matching to join two dataframe

查看:43
本文介绍了加入两个数据框的模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个要合并的餐厅信息数据框.

I have 2 dataframes of restaurant information to merge.

df1 = pd.DataFrame ({'Restaurant_Name': ['Apple', 'Banana', 'Orange', 'apple','apple1'],
                     'Postal Code': [12345, 12345, 54321, 54321,1111]})
df2 = pd.DataFrame ({'Restaurant_Name': ['apple', 'apple', 'Banana'],
                     'Postal Code': [12345, 54321, 12345],
                    'Phone':[100,200,300]})

<小时>

d1:


d1:


df2:


df2:

  • 每家餐厅都有一个邮政编码(不是唯一的,可以有 2 家餐厅位于同一个地方).所以我不能合并基于邮政编码.
  • 但是可以通过以下方式区分共享相同邮政编码的餐馆他们的名字.
  • 餐厅名称的拼写略有不同,因此我也无法根据餐厅名称进行合并

理想情况下,我想生成一个如下所示的表格:

Ideally I want to produce a table that looks like this:

我尝试基于模糊匹配和邮政编码匹配来匹配餐厅名称,但无法获得非常准确的结果.我还尝试将餐厅名称与每个数据帧的邮政编码连接起来,并对连接的结果进行模糊匹配,但我认为这不是最好的方法.

I tried to match the restaurant names based on fuzzy matching followed by a match of postal code, but was not able to get a very accurate result. I also tried to concatenate the restaurant name with postal code for each of the dataframe and do a fuzzy matching of the concatenated result but I don't think this is the best way.

有没有办法在匹配两个数据帧时达到 100% 的准确率?

Is there any way to achieve 100% accuracy in matching the two dataframes?

推荐答案

检查 difflib.get_close_matches().

我使用您的示例数据框尝试了此操作.有用吗?

I tried this using your sample dataframe. Does it help?

import pandas as pd
import difflib

df1 = pd.DataFrame ({'Restaurant_Name': ['Apple', 'Banana', 'Orange', 'apple','apple1'],
                     'Postal Code': [12345, 12345, 54321, 54321,1111]})
df2 = pd.DataFrame ({'Restaurant_Name': ['apple', 'apple', 'Banana'],
                     'Postal Code': [12345, 54321, 12345],
                    'Phone':[100,200,300]})

df1['key'] = df1['Restaurant_Name']+df1['Postal Code'].astype(str)
df2['key'] = df2['Restaurant_Name']+df2['Postal Code'].astype(str)
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'])[0])

df1.merge(df2, on='key', how='outer')[['Restaurant_Name_x','Restaurant_Name_y','Postal Code_x','Phone']]

输出:

  Restaurant_Name_x Restaurant_Name_y  Postal Code_x  Phone
0             Apple             apple          12345  100.0
1            Banana            Banana          12345  300.0
2            Orange               NaN          54321    NaN
3             apple             apple          54321  200.0
4            apple1               NaN           1111    NaN

如您所说,我确实将餐厅名称与邮政编码连接起来以获得独特的组合.

As you said, I did concatenate the restaurant name with postal code to get a unique combination.

这篇关于加入两个数据框的模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆