如何使用模糊模糊比将一个数据框中的值与另一个数据框中的列进行比较 [英] How to compare a value in one dataframe to a column in another using fuzzywuzzy ratio

查看:90
本文介绍了如何使用模糊模糊比将一个数据框中的值与另一个数据框中的列进行比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有10个已解析地址的数据帧df_sample,并将其与另一个具有成千上万个已解析地址记录df的数据帧进行比较. df_sampledf都具有完全相同的结构:

I have a dataframe df_sample with 10 parsed addresses and am comparing it to another dataframe with hundreds of thousands of parsed address records df. Both df_sample and df share the exact same structure:

zip_code     city        state     street_number    street_name   unit_number   country
 12345    FAKEVILLE     FLORIDA          123           FAKE ST        NaN          US

我想做的是将df_sample中的单行与df中的每一行匹配,从state开始,仅将fuzzy.ratio(df['state'], df_sample['state']) > 0.9所在的行放入新的数据帧中.一旦从这些匹配中创建了这个新的,较小的数据帧,我将继续对cityzip_code等进行此操作.

What I want to do is match a single row in df_sample against every row in df, starting with state and take only the rows where the fuzzy.ratio(df['state'], df_sample['state']) > 0.9 into a new dataframe. Once this new, smaller dataframe is created from those matches, I would continue to do this for city, zip_code, etc. Something like:

df_match = df[fuzzy.ratio(df_sample['state'], df['state']) > 0.9]

除非那是行不通的.

我的目标是每次使用更严格的搜索条件时都缩小匹配的数量,并最终根据每个字段分别缩小范围而最终获得尽可能少匹配的数据框.但是我不确定如何对任何单个记录执行此操作.

My goal is to narrow down the number of matches each time I use a harder search criterion, and eventually end up with a dataframe with as few matches as possible based on narrowing it down by each column individually. But I am unsure as to how to do this for any single record.

推荐答案

创建数据框

import pandas as pd
from fuzzywuzzy import fuzz 

df = pd.DataFrame({'key': [1, 1, 1, 1, 1],
                   'zip': [1, 2, 3, 4, 5],
                   'state': ['Florida', 'Nevada', 'Texas', 'Florida', 'Texas']})

df_sample = pd.DataFrame({'key': [1, 1, 1, 1, 1],
                          'zip': [6, 7, 8, 9, 10],
                          'state': ['florida', 'Flor', 'NY', 'Florida', 'Tx']})

merged_df = df_sample.merge(df, on='key')
merged_df['fuzzy_ratio'] = merged_df.apply(lambda row: fuzz.ratio(row['state_x'], row['state_y']), axis=1)
merged_df

您会得到每对的模糊比

    key  zip_x  state_x  zip_y  state_y  fuzzy_ratio
0     1      6  florida      1  Florida           86
1     1      6  florida      2   Nevada           31
2     1      6  florida      3    Texas           17
3     1      6  florida      4  Florida           86
4     1      6  florida      5    Texas           17
5     1      7     Flor      1  Florida           73
6     1      7     Flor      2   Nevada            0
7     1      7     Flor      3    Texas            0
8     1      7     Flor      4  Florida           73
9     1      7     Flor      5    Texas            0
10    1      8       NY      1  Florida            0
11    1      8       NY      2   Nevada           25
12    1      8       NY      3    Texas            0
13    1      8       NY      4  Florida            0
14    1      8       NY      5    Texas            0
15    1      9  Florida      1  Florida          100
16    1      9  Florida      2   Nevada           31
17    1      9  Florida      3    Texas           17
18    1      9  Florida      4  Florida          100
19    1      9  Florida      5    Texas           17
20    1     10       Tx      1  Florida            0
21    1     10       Tx      2   Nevada            0
22    1     10       Tx      3    Texas           57
23    1     10       Tx      4  Florida            0
24    1     10       Tx      5    Texas           57

然后过滤掉不需要的内容

then filter out what you don't want

mask = (merged_df['fuzzy_ratio']>80)
merged_df[mask]

结果:

    key  zip_x  state_x  zip_y  state_y  fuzzy_ratio
0     1      6  florida      1  Florida           86
3     1      6  florida      4  Florida           86
15    1      9  Florida      1  Florida          100
18    1      9  Florida      4  Florida          100

这篇关于如何使用模糊模糊比将一个数据框中的值与另一个数据框中的列进行比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆