如何基于相似度函数合并两个 pandas DataFrame? [英] How to merge two pandas DataFrames based on a similarity function?

查看:112
本文介绍了如何基于相似度函数合并两个 pandas DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出数据集1

name,x,y
st. peter,1,2
big university portland,3,4

和数据集2

name,x,y
saint peter3,4
uni portland,5,6

目标是在

d1.merge(d2, on="name", how="left")

尽管名称上没有完全匹配.因此,我正在寻找一种模糊匹配的方法.在这种情况下,这项技术并不重要,更多的是如何有效地将其整合到熊猫中.

There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.

例如,st. peter可能与另一个saint peter相匹配,但是big university portland可能有很大的偏差,以至于我们无法与uni portland匹配.

For example, st. peter might match saint peter in the other, but big university portland might be too much of a deviation that we wouldn't match it with uni portland.

一种考虑方式是允许以最小的Levenshtein距离加入,但前提是该距离小于5次编辑(st. --> saint为4).

One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint is 4).

结果数据框应仅包含行st. peter,并包含名称"变体和xy变量.

The resulting dataframe should only contain the row st. peter, and contain both "name" variations, and both x and y variables.

有没有办法使用大熊猫进行这种合并?

Is there a way to do this kind of merging using pandas?

推荐答案

您是否看过 fuzzywuzzy ?

您可能会执行以下操作:

You might do something like:

import pandas as pd
import fuzzywuzzy.process as fwp

choices = list(df2.name)

def fmatch(row): 
    minscore=95 #or whatever score works for you
    choice,score = fwp.extractOne(row.name,choices)
    return choice if score > minscore else None

df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1, 
                  df2,
                  left_on='df2_name',
                  right_on='name',
                  suffixes=['_df1','_df2'],
                  how = 'outer') # assuming you want to keep unmatched records

Caveat Emptor:我还没有尝试运行它.

Caveat Emptor: I haven't tried to run this.

这篇关于如何基于相似度函数合并两个 pandas DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆