如何基于一列的字符串相似性链接两个数据框 [英] How to link two dataframes based on the string similarity of one column

查看:60
本文介绍了如何基于一列的字符串相似性链接两个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧,都有一个ID和一个包含字符串的列Name.他们可能看起来像这样:

I have two dataframes, both have an ID and a Column Name that contains Strings. They might look like this:

数据框:

DF-1                              DF-2
---------------------             ---------------------
     ID          Name                  ID          Name
1    56       aaeessa             1    12    H.P paRt 1 
2    98       1o7v9sM             2    76       aa3esza
3   175    HP. part 1             3   762    stakoverfl 
4     2     stackover             4     2       lo7v9Sm

我想计算一个元素与所有其他元素之间的字符串相似度(例如:Jaccard,Levenshtein),然后选择得分最高的元素.然后匹配两个ID,以便以后可以加入完整的数据框.结果表应如下所示:

I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:

结果:

Result
-----------------
     ID1     ID2
1    56       76
2    98        2
3   175       12
4     2      762

使用double for循环可以很容易地实现这一点,但是我正在寻找一种优雅(且更快的方式)来实现这一点,也许是lambda列表理解或某些pandas工具.也许groupbyidxmax的相似度得分有些组合,但是我自己还不太想出解决方法.

This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby and idxmax for the similarity score but I can't quite come up with the soltution by myself.

编辑:数据帧的长度不同,此功能的目的之一是确定较小数据帧中的哪些元素出现在较大数据帧中,并与之匹配,并丢弃其余元素.因此,在结果表中应仅显示匹配的ID对或ID1-NaN对(假设DF-1具有比DF-2多的行).

The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1 has more rows than DF-2).

推荐答案

使用pandas重复数据删除软件包: https://pypi.org/project/pandas-dedupe/

Using the pandas dedupe package: https://pypi.org/project/pandas-dedupe/

您需要使用人工输入来训练分类器,然后它将使用学习到的设置来匹配整个数据框.

You need to train the classifier with human input and then it will use the learned setting to match the whole dataframe.

首先pip install pandas-dedupe并尝试以下操作:

import pandas as pd
import pandas_dedupe

df1=pd.DataFrame({'ID':[56,98,175],
                 'Name':['aaeessa', '1o7v9sM', 'HP. part 1']})

df2=pd.DataFrame({'ID':[12,76,762,2],
                 'Name':['H.P paRt 1', 'aa3esza', 'stakoverfl ', 'lo7v9Sm']})


#initiate matching
df_final = pandas_dedupe.link_dataframes(df1, df2, ['Name'])

# reset index
df_final = df_final.reset_index(drop=True)

# print result

print(df_final)

    ID        Name  cluster id  confidence
0   98     1o7v9sm         0.0    1.000000
1    2     lo7v9sm         0.0    1.000000
2  175  hp. part 1         1.0    0.999999
3   12  h.p part 1         1.0    0.999999
4   56     aaeessa         2.0    0.999967
5   76     aa3esza         2.0    0.999967
6  762  stakoverfl         NaN         NaN

您可以看到为匹配的对分配了聚类和置信度. nan.您现在可以根据需要分析此信息.例如,可能仅获得置信度高于80%的结果.

you can see matched pairs are assigned a cluster and confidence level. unmatched are nan. you can now analyse this info however you wish. perhaps only take results with a confidence level above 80% for example.

这篇关于如何基于一列的字符串相似性链接两个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆