在数据框列上应用模糊匹配,然后将结果保存到新列中 [英] Apply fuzzy matching across a dataframe column and save results in a new column

查看:111
本文介绍了在数据框列上应用模糊匹配,然后将结果保存到新列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧,每个数据帧具有不同的行数.下面是每个数据集中的几行

I have two data frames with each having a different number of rows. Below is a couple rows from each data set

df1 =
     Company                                   City         State  ZIP
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102
     LACKEY SHEET METAL                        St. Louis    MO     63102

df2 = 
     FDA Company                    FDA City    FDA State   FDA ZIP
     LACKEY SHEET METAL             St. Louis   MO          63102
     PRIMUS STERILIZER COMPANY LLC  Great Bend  KS          67530
     HELGET GAS PRODUCTS INC        Omaha       NE          68127
     ORTHOQUEST LLC                 La Vista    NE          68128

我使用combined_data = pandas.concat([df1, df2], axis = 1)并排加入了他们.我的下一个目标是使用fuzzy wuzzy模块中的几个不同的匹配命令,将df1['Company']下的每个字符串与df2['FDA Company']下的每个字符串进行比较,并返回最佳匹配的值及其名称.我想将其存储在新列中.例如,如果我对df1['Company']中的df1['Company']中的LACKY SHEET METAL中的fuzz.ratiofuzz.token_sort_ratio执行了操作,则返回的最佳匹配是LACKY SHEET METAL,且得分为100,然后将其保存在combined data中的新列.结果看起来像

I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like

combined_data =
     Company                                   City         State  ZIP      FDA Company                     FDA City    FDA State   FDA ZIP     fuzzy.token_sort_ratio    match    fuzzy.ratio         match
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101    LACKEY SHEET METAL              St. Louis   MO          63102       LACKEY SHEET METAL        100      LACKEY SHEET METAL  100
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102    PRIMUS STERILIZER COMPANY LLC   Great Bend  KS          67530
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102    HELGET GAS PRODUCTS INC         Omaha       NE          68127
     LACKEY SHEET METAL                        St. Louis    MO     63102    ORTHOQUEST LLC                  La Vista    NE          68128

我尝试做

combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1) 

但是出现错误,因为列的长度不同.

But got an error because the lengths of the columns are different.

我很困惑.我怎么能做到这一点?

I am stumped. How I can accomplish this?

推荐答案

我不知道你在做什么.这就是我要做的.

I couldn't tell what you were doing. This is how I would do it.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

创建一系列元组进行比较:

Create a series of tuples to compare:

compare = pd.MultiIndex.from_product([df1['Company'],
                                      df2['FDA Company']]).to_series()

创建一个特殊函数来计算模糊度量并返回序列.

Create a special function to calculate fuzzy metrics and return a series.

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

metrics应用于compare系列

compare.apply(metrics)

有很多方法可以完成下一部分:

There are bunch of ways to do this next part:

获取与df1

compare.apply(metrics).unstack().idxmax().unstack(0)

获取与df2

compare.apply(metrics).unstack(0).idxmax().unstack(0)

这篇关于在数据框列上应用模糊匹配,然后将结果保存到新列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆