如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数 [英] How to apply a fuzzy matching function on the target and reference columns for pandas dataframes

查看:75
本文介绍了如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

******使用下面的解决方案进行编辑*******

******Edited with Solution Below*******

我已经仔细阅读了指南,希望这个问题可以接受.

I have carefully read the guidelines, hope the question is acceptable.

我有两个熊猫数据框,我需要在目标列和参考列上应用模糊匹配函数,并根据相似度评分合并数据,以保留原始数据.

I have two pandas dataframes, I need to apply a fuzzy matching function on the target and reference columns and merge the data based on the similarity score preserving the original data.

我已经检查了类似的问题,例如参见:
是否可以使用python进行模糊匹配合并熊猫? 但我无法使用此解决方案. 到目前为止,我有:

i have checked similar questions, e.g. see:
is it possible to do fuzzy match merge with python pandas? but I am not able to use this solution. So far I have:

df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})

import distance
fulldf=[]
for name1 in  df1['RefName']:
    for name2 in df2['TarName']:
        if distance.jaccard(name1, name2)<0.6:
            fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)

  1. 如何在最终输出中包括"NameId"和"Type"(以及其他最终列),例如:

  1. How can I include the 'NameId' and 'Type' (and eventual other columns) in the final output e.g.:

df1_NameId  RefName         df1_Type    df1_NewColumn   Score   df2_NameId  TarName         df2_Type    df2_NewColumn
   1        robert johnes   Person           …          0.0000     1        roberto johnes  Person          …

  • 是否有一种方法可以对此进行编码,从而使其易于扩展,并且可以在具有数十万行的数据集上执行?

  • Is there a way to code this so that is easily scalable, and can be performed on datasets with hundred thousands of rows?


    我已经通过在循环中解包数据帧解决了原始问题:


    I have solved the original problem by unpacking the dataframes in the loop:

    import distance
    import pandas as pd
    #Create test Dataframes
    df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew       malinsky','gioberto delle lanterne']})
    df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
    results=[]
    #Create two generators objects to loop through each dataframe row one at the time
    #Call each dataframe element that you want to have in the final output in the loop
    #Append results to the empty list you created
    for a,b,c in df1.itertuples():
        for d,e,f in df2.itertuples():
            results.append((a,b,c,distance.jaccard(c, f),e,d,f))
    
    result_df=pd.DataFrame(results)
    print(result_df)
    

    推荐答案

    我相信您需要的是 Cartesian TarNameRefName的乘积.将距离功能应用到产品是您所需的结果.

    I believe what you need is Cartesian Product of TarName and RefName. Applying distance function to the product is the result you required.

    df1["mergekey"] = 0
    df2["mergekey"] = 0
    df_merged = pd.merge(df1, df2, on = "mergekey")
    df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)
    

    结果:

        NameId_x    RefName Type_x  mergekey    NameId_y    TarName Type_y  Distance
    0   1   robert johnes   Person  0   1   roberto johnes  Person  0.000000
    1   1   robert johnes   Person  0   2   lew malinosky   Person  0.705882
    2   1   robert johnes   Person  0   3   andreatta della blatta  Person  0.538462
    3   2   lew malinsky    Person  0   1   roberto johnes  Person  0.764706
    4   2   lew malinsky    Person  0   2   lew malinosky   Person  0.083333
    5   2   lew malinsky    Person  0   3   andreatta della blatta  Person  0.666667
    6   3   gioberto delle lanterne Person  0   1   roberto johnes  Person  0.533333
    7   3   gioberto delle lanterne Person  0   2   lew malinosky   Person  0.588235
    8   3   gioberto delle lanterne Person  0   3   andreatta della blatta  Person  0.250000
    

    这篇关于如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆