通过在 python 中使用部分匹配比较来自不同数据帧的 2 列来映射值 [英] Map values by comparing 2 columns from different dataframes using partial match in python

查看:40
本文介绍了通过在 python 中使用部分匹配比较来自不同数据帧的 2 列来映射值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个数据帧 df1 包含 id_numberdf2 包含 identity_No.

希望使用条件将最高匹配行值从 Dataframe2 映射到 Dataframe1.想要将 df1 的每一行 (df1['id_number']) 与 df2 的整列 (df2['identity_No']) 进行比较.我也试过使用部分匹配,但没有得到输出.

df1

score id_number company_name company_code match_acc action_reqd20 IN2231D AXN pvt Ltd IN225 是45 UK654IN 英杰华国际有限公司 IN115 否65 SL1432H 船舶公司 CZ555 是35 LK0678G Oppo Mobiles pvt ltd PQ795 是59 NG5678J 诺基亚公司 RS885 否20 IN2231D AXN pvt Ltd IN215 是

df2

OR_score identity_No comp_name comp_code51 UK654IN 英杰华国际有限公司 IN51525 SL6752J Ship Inc Traders CZ55579 NG5678K 诺基亚公司 RS00520 IN22312 AXN 私人有限公司 IN25538 LK0665G Oppo Mobiles ltd PQ895

例如:df1.id_number 需要与 df2.identity_No 进行比较,根据 的 row1 寻找匹配df1['id_number'] 将匹配 df2['identity_No'] 的所有行,并且具有最高的匹配百分比wrt.df2['identity_No'] 的第 4 行,并且超过 80%,它将把各自的值从 df2 的第 4 行复制到 df1 的第 1 行.df1 的每一行同样适用.

预期输出:

score id_number company_name company_code match_acc action_reqd20 IN22312 AXN pvt Ltd IN225 90 是51 UK654IN 英杰华国际有限公司 IN115 100 否25 SL1432H Ship Incorporations CZ555 30 是38 LK0665G Oppo Mobiles ltd PQ795 80 是79 NG5678K 诺基亚公司 RS885 85 否20 IN22312 AXN pvt Ltd IN225 90 是

我现在已经试过了:

 用于索引,df1.iterrows() 中的行:对于 df2.iterrows() 中的 index2、config2:如果 process.extractOne(row[id_number"], df[identity_No"])[1] >=80:df1['id_number'][index] = config2['identity_No']df1['company_name'][index] = config2['comp_name']df1['company_code'][index] = config2['comp_code']df1['score'][index] = config2['OR_Score']

<块引用>

尝试 2

 用于索引,df1.iterrows() 中的行:对于 df2.iterrows() 中的 index2、config2:如果 fuzz.partial_ratio(row[id_number"], config2[identity_No"]) >=80:

请推荐

解决方案

从我之前的 answer 开始.

嵌入评论:

将pandas导入为pd来自fuzzywuzzy导入过程# 2个数据框之间的列映射cols1 = ["score", "id_number", "company_name", "company_code"]cols2 = [OR_score"、identity_No"、comp_name"、comp_code"]# 在选项列表中找到高于分数的单个最佳匹配.dfm = pd.DataFrame(df1[id_number"].apply(lambda x: process.extractOne(x, df2[identity_No"])).tolist(), columns=[match_comp", match_acc", match_idx"])# 获取(df1, df2)满足条件(match_acc> 80)的索引idx1, idx2 = dfm.loc[dfm[match_acc"] >80,match_idx"].reset_index().values.T.tolist()# 将 df2 的值更新为 df1df1.loc[idx1, cols1] = df2.loc[idx2, cols2].valuesdf1["match_acc"] = dfm["match_acc"] # 不要忘记 match_acc 列

>>>df1得分 id_number company_name company_code match_acc action_reqd0 20 IN22312 AXN pvt Ltd IN255 86 是1 51 UK654IN 英杰华国际有限公司 IN515 100 否2 65 SL1432H 船舶公司 CZ555 43 是3 35 LK0678G Oppo Mobiles pvt ltd PQ795 71 是4 79 NG5678K 诺基亚公司 RS005 86 否5 20 IN22312 AXN pvt Ltd IN255 86 是

对您的输入数据进行测试:

df1 = pd.read_csv(io.StringIO("""score,id_number,company_name,company_code,match_acc,action_reqd20,IN2231D,AXN pvt Ltd,IN225,,是45,UK654IN,英杰华国际有限公司,IN115,,No65,SL1432H,Ship Incorporations,CZ555,,是35,LK0678G,Oppo Mobiles pvt ltd,PQ795,,是59,NG5678J,诺基亚公司,RS885,,没有20,IN2231D,AXN pvt Ltd,IN215,,Yes"""))df2 = pd.read_csv(io.StringIO("""OR_score,identity_No,comp_name,comp_code51,UK654IN,英杰华国际有限公司,IN51525,SL6752J,Ship Inc Traders,CZ55579,NG5678K,诺基亚公司,RS00520,IN22312,AXN pvt Ltd,IN25538,LK0665G,Oppo Mobiles ltd,PQ895"""))

I have 2 dataframes df1 containing id_number and df2 containing identity_No.

Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. Want to compare each row(df1['id_number']) from df1 with the whole column(df2['identity_No']) of df2. I have tried using partial match as well but not getting the output.

df1

score   id_number       company_name      company_code     match_acc     action_reqd
20      IN2231D           AXN pvt Ltd        IN225                          Yes
45      UK654IN        Aviva Intl Ltd        IN115                          No
65      SL1432H   Ship Incorporations        CZ555                          Yes
35      LK0678G  Oppo Mobiles pvt ltd        PQ795                          Yes
59      NG5678J             Nokia Inc        RS885                          No
20      IN2231D           AXN pvt Ltd        IN215                          Yes

df2

OR_score   identity_No       comp_name        comp_code   
51          UK654IN        Aviva Int.L Ltd       IN515  
25          SL6752J       Ship Inc Traders       CZ555  
79          NG5678K             Nokia Inc        RS005 
20          IN22312           AXN pvt Ltd        IN255
38          LK0665G       Oppo Mobiles ltd       PQ895 

For Eg: The df1.id_number need to be compared with df2.identity_No, Looking to match based on row1 of df1['id_number'] will match across all rows of df2['identity_No'], and has highest match percentage wrt. row4 of df2['identity_No'] , and its more than 80%, it will copy the respective values from row4 of df2 to row1 of df1. same to be applied for each row of df1.

Expected Output:

score   id_number       company_name      company_code     match_acc     action_reqd
20      IN22312           AXN pvt Ltd        IN225              90          Yes
51      UK654IN       Aviva Int.L Ltd        IN115              100         No
25      SL1432H   Ship Incorporations        CZ555              30          Yes
38      LK0665G      Oppo Mobiles ltd        PQ795              80          Yes
79      NG5678K             Nokia Inc        RS885              85          No
20      IN22312           AXN pvt Ltd        IN225              90          Yes

I have tried this now:

for index, row in df1.iterrows():
    for index2, config2 in df2.iterrows():
        if process.extractOne(row["id_number"], df["identity_No"])[1] >=80:
            df1['id_number'][index] = config2['identity_No']
            df1['company_name'][index] = config2['comp_name']
            df1['company_code'][index] = config2['comp_code']
            df1['score'][index] = config2['OR_Score']

Attempt2

for index, row in df1.iterrows():
        for index2, config2 in df2.iterrows():
            if fuzz.partial_ratio(row["id_number"], config2["identity_No"]) >=80:

Please Suggest

解决方案

Starting from my previous answer.

Comments embedded:

import pandas as pd
from fuzzywuzzy import process

# Column mapping between the 2 dataframes
cols1 = ["score", "id_number", "company_name", "company_code"]
cols2 = ["OR_score", "identity_No", "comp_name", "comp_code"]

# Find the single best match above a score in a list of choices.
dfm = pd.DataFrame(df1["id_number"].apply(lambda x: process.extractOne(x, df2["identity_No"]))
                                   .tolist(), columns=["match_comp", "match_acc", "match_idx"])

# Get the indexes of (df1, df2) which satisfy the condition (match_acc> 80)
idx1, idx2 = dfm.loc[dfm["match_acc"] > 80, "match_idx"].reset_index().values.T.tolist()

# Update values from df2 to df1
df1.loc[idx1, cols1] = df2.loc[idx2, cols2].values
df1["match_acc"] = dfm["match_acc"]  # don't forget match_acc column

>>> df1
   score id_number          company_name company_code  match_acc action_reqd
0     20   IN22312           AXN pvt Ltd        IN255         86         Yes
1     51   UK654IN       Aviva Int.L Ltd        IN515        100          No
2     65   SL1432H   Ship Incorporations        CZ555         43         Yes
3     35   LK0678G  Oppo Mobiles pvt ltd        PQ795         71         Yes
4     79   NG5678K             Nokia Inc        RS005         86          No
5     20   IN22312           AXN pvt Ltd        IN255         86         Yes

Tested on your input data:

df1 = pd.read_csv(io.StringIO("""score,id_number,company_name,company_code,match_acc,action_reqd
20,IN2231D,AXN pvt Ltd,IN225,,Yes
45,UK654IN,Aviva Intl Ltd,IN115,,No
65,SL1432H,Ship Incorporations,CZ555,,Yes
35,LK0678G,Oppo Mobiles pvt ltd,PQ795,,Yes
59,NG5678J,Nokia Inc,RS885,,No
20,IN2231D,AXN pvt Ltd,IN215,,Yes"""))

df2 = pd.read_csv(io.StringIO("""OR_score,identity_No,comp_name,comp_code
51,UK654IN,Aviva Int.L Ltd,IN515
25,SL6752J,Ship Inc Traders,CZ555
79,NG5678K,Nokia Inc,RS005
20,IN22312,AXN pvt Ltd,IN255
38,LK0665G,Oppo Mobiles ltd,PQ895"""))

这篇关于通过在 python 中使用部分匹配比较来自不同数据帧的 2 列来映射值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆