如何使用模糊模糊比将一个数据框中的值与另一个数据框中的列进行比较 [英] How to compare a value in one dataframe to a column in another using fuzzywuzzy ratio
问题描述
我有一个具有10个已解析地址的数据帧df_sample
,并将其与另一个具有成千上万个已解析地址记录df
的数据帧进行比较. df_sample
和df
都具有完全相同的结构:
I have a dataframe df_sample
with 10 parsed addresses and am comparing it to another dataframe with hundreds of thousands of parsed address records df
. Both df_sample
and df
share the exact same structure:
zip_code city state street_number street_name unit_number country
12345 FAKEVILLE FLORIDA 123 FAKE ST NaN US
我想做的是将df_sample
中的单行与df
中的每一行匹配,从state
开始,仅将fuzzy.ratio(df['state'], df_sample['state']) > 0.9
所在的行放入新的数据帧中.一旦从这些匹配中创建了这个新的,较小的数据帧,我将继续对city
,zip_code
等进行此操作.
What I want to do is match a single row in df_sample
against every row in df
, starting with state
and take only the rows where the fuzzy.ratio(df['state'], df_sample['state']) > 0.9
into a new dataframe. Once this new, smaller dataframe is created from those matches, I would continue to do this for city
, zip_code
, etc. Something like:
df_match = df[fuzzy.ratio(df_sample['state'], df['state']) > 0.9]
除非那是行不通的.
我的目标是每次使用更严格的搜索条件时都缩小匹配的数量,并最终根据每个字段分别缩小范围而最终获得尽可能少匹配的数据框.但是我不确定如何对任何单个记录执行此操作.
My goal is to narrow down the number of matches each time I use a harder search criterion, and eventually end up with a dataframe with as few matches as possible based on narrowing it down by each column individually. But I am unsure as to how to do this for any single record.
推荐答案
创建数据框
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'key': [1, 1, 1, 1, 1],
'zip': [1, 2, 3, 4, 5],
'state': ['Florida', 'Nevada', 'Texas', 'Florida', 'Texas']})
df_sample = pd.DataFrame({'key': [1, 1, 1, 1, 1],
'zip': [6, 7, 8, 9, 10],
'state': ['florida', 'Flor', 'NY', 'Florida', 'Tx']})
merged_df = df_sample.merge(df, on='key')
merged_df['fuzzy_ratio'] = merged_df.apply(lambda row: fuzz.ratio(row['state_x'], row['state_y']), axis=1)
merged_df
您会得到每对的模糊比
key zip_x state_x zip_y state_y fuzzy_ratio
0 1 6 florida 1 Florida 86
1 1 6 florida 2 Nevada 31
2 1 6 florida 3 Texas 17
3 1 6 florida 4 Florida 86
4 1 6 florida 5 Texas 17
5 1 7 Flor 1 Florida 73
6 1 7 Flor 2 Nevada 0
7 1 7 Flor 3 Texas 0
8 1 7 Flor 4 Florida 73
9 1 7 Flor 5 Texas 0
10 1 8 NY 1 Florida 0
11 1 8 NY 2 Nevada 25
12 1 8 NY 3 Texas 0
13 1 8 NY 4 Florida 0
14 1 8 NY 5 Texas 0
15 1 9 Florida 1 Florida 100
16 1 9 Florida 2 Nevada 31
17 1 9 Florida 3 Texas 17
18 1 9 Florida 4 Florida 100
19 1 9 Florida 5 Texas 17
20 1 10 Tx 1 Florida 0
21 1 10 Tx 2 Nevada 0
22 1 10 Tx 3 Texas 57
23 1 10 Tx 4 Florida 0
24 1 10 Tx 5 Texas 57
然后过滤掉不需要的内容
then filter out what you don't want
mask = (merged_df['fuzzy_ratio']>80)
merged_df[mask]
结果:
key zip_x state_x zip_y state_y fuzzy_ratio
0 1 6 florida 1 Florida 86
3 1 6 florida 4 Florida 86
15 1 9 Florida 1 Florida 100
18 1 9 Florida 4 Florida 100
这篇关于如何使用模糊模糊比将一个数据框中的值与另一个数据框中的列进行比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!