如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数 [英] How to apply a fuzzy matching function on the target and reference columns for pandas dataframes

查看：75 发布时间：2020/5/24 1:53:48 python-3.x pandas

本文介绍了如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

******使用下面的解决方案进行编辑*******

******Edited with Solution Below*******

我已经仔细阅读了指南，希望这个问题可以接受.

I have carefully read the guidelines, hope the question is acceptable.

我有两个熊猫数据框，我需要在目标列和参考列上应用模糊匹配函数，并根据相似度评分合并数据，以保留原始数据.

I have two pandas dataframes, I need to apply a fuzzy matching function on the target and reference columns and merge the data based on the similarity score preserving the original data.

我已经检查了类似的问题，例如参见:
是否可以使用python进行模糊匹配合并熊猫? 但我无法使用此解决方案. 到目前为止，我有:

i have checked similar questions, e.g. see:
is it possible to do fuzzy match merge with python pandas? but I am not able to use this solution. So far I have:

df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})

import distance
fulldf=[]
for name1 in  df1['RefName']:
    for name2 in df2['TarName']:
        if distance.jaccard(name1, name2)<0.6:
            fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)

如何在最终输出中包括"NameId"和"Type"(以及其他最终列)，例如:

How can I include the 'NameId' and 'Type' (and eventual other columns) in the final output e.g.:

df1_NameId  RefName         df1_Type    df1_NewColumn   Score   df2_NameId  TarName         df2_Type    df2_NewColumn
   1        robert johnes   Person           …          0.0000     1        roberto johnes  Person          …

是否有一种方法可以对此进行编码，从而使其易于扩展，并且可以在具有数十万行的数据集上执行?

Is there a way to code this so that is easily scalable, and can be performed on datasets with hundred thousands of rows?

我已经通过在循环中解包数据帧解决了原始问题:

I have solved the original problem by unpacking the dataframes in the loop:

import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew       malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
    for d,e,f in df2.itertuples():
        results.append((a,b,c,distance.jaccard(c, f),e,d,f))

result_df=pd.DataFrame(results)
print(result_df)

推荐答案

我相信您需要的是 Cartesian TarName和RefName的乘积.将距离功能应用到产品是您所需的结果.

I believe what you need is Cartesian Product of TarName and RefName. Applying distance function to the product is the result you required.

df1["mergekey"] = 0
df2["mergekey"] = 0
df_merged = pd.merge(df1, df2, on = "mergekey")
df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)

结果:

    NameId_x    RefName Type_x  mergekey    NameId_y    TarName Type_y  Distance
0   1   robert johnes   Person  0   1   roberto johnes  Person  0.000000
1   1   robert johnes   Person  0   2   lew malinosky   Person  0.705882
2   1   robert johnes   Person  0   3   andreatta della blatta  Person  0.538462
3   2   lew malinsky    Person  0   1   roberto johnes  Person  0.764706
4   2   lew malinsky    Person  0   2   lew malinosky   Person  0.083333
5   2   lew malinsky    Person  0   3   andreatta della blatta  Person  0.666667
6   3   gioberto delle lanterne Person  0   1   roberto johnes  Person  0.533333
7   3   gioberto delle lanterne Person  0   2   lew malinosky   Person  0.588235
8   3   gioberto delle lanterne Person  0   3   andreatta della blatta  Person  0.250000

这篇关于如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数 [英] How to apply a fuzzy matching function on the target and reference columns for pandas dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数 [英] How to apply a fuzzy matching function on the target and reference columns for pandas dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭