在数据框列上应用模糊匹配,然后将结果保存到新列中 [英] Apply fuzzy matching across a dataframe column and save results in a new column
问题描述
我有两个数据帧,每个数据帧具有不同的行数.下面是每个数据集中的几行
I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
和
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
我使用combined_data = pandas.concat([df1, df2], axis = 1)
并排加入了他们.我的下一个目标是使用fuzzy wuzzy
模块中的几个不同的匹配命令,将df1['Company']
下的每个字符串与df2['FDA Company']
下的每个字符串进行比较,并返回最佳匹配的值及其名称.我想将其存储在新列中.例如,如果我对df1['Company']
中的df1['Company']
中的LACKY SHEET METAL
中的fuzz.ratio
和fuzz.token_sort_ratio
执行了操作,则返回的最佳匹配是LACKY SHEET METAL
,且得分为100
,然后将其保存在combined data
中的新列.结果看起来像
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1)
. My next goal is to compare each string under df1['Company']
to each string under in df2['FDA Company']
using several different matching commands from the fuzzy wuzzy
module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio
and fuzz.token_sort_ratio
on LACKY SHEET METAL
in df1['Company']
to df2['FDA Company']
it would return that the best match was LACKY SHEET METAL
with a score of 100
and this would then be saved under a new column in combined data
. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
我尝试做
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
但是出现错误,因为列的长度不同.
But got an error because the lengths of the columns are different.
我很困惑.我怎么能做到这一点?
I am stumped. How I can accomplish this?
推荐答案
我不知道你在做什么.这就是我要做的.
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
创建一系列元组进行比较:
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
创建一个特殊函数来计算模糊度量并返回序列.
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
将metrics
应用于compare
系列
compare.apply(metrics)
有很多方法可以完成下一部分:
There are bunch of ways to do this next part:
获取与df1
compare.apply(metrics).unstack().idxmax().unstack(0)
获取与df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)
这篇关于在数据框列上应用模糊匹配,然后将结果保存到新列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!