pandas 模糊合并/匹配名称列，包含重复项 [英] Pandas fuzzy merge/match name column, with duplicates

查看：78 发布时间：2020/5/23 22:56:51 python pandas dataframe fuzzywuzzy fuzzy-comparison

本文介绍了 pandas 模糊合并/匹配名称列，包含重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前有两个数据帧，一个用于donors，一个用于fundraisers.我正在尝试查找是否有任何fundraisers也进行了捐赠，如果有的话，请将其中的一些信息复制到我的fundraiser数据集中(捐赠者的姓名，电子邮件及其首次捐赠).我的数据存在以下问题:

I have two dataframes currently, one for donors and one for fundraisers. I'm trying to find if any fundraisers also gave donations, and if so, copy some of that information into my fundraiser dataset (donor name, email and their first donation). Problems with my data are:

我需要通过姓名和电子邮件进行匹配，但是用户的姓名可能略有不同(例如，"Kat"和"Kathy").
donors和fundraisers的重复名称:
- 2a)与捐赠者一起，我可以得到唯一的姓名/电子邮件组合，因为我只在乎首次捐赠日期
- 2b)尽管有筹款人，但我需要保留两行，并且不要丢失日期等数据.

I need to match by name and email, but a user might have slightly different names (ex 'Kat' and 'Kathy').
Duplicate names for donors and fundraisers:
- 2a) With donors I can get unique name/email combinations since I just care about the first donation date
- 2b) With fundraisers though I need to keep both rows and not lose data like the date.

我现在有示例代码:

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['a@a.ca','a@a.ca','b@b.ca','c@c.ca','something@a.ca','d@d.ca']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['a@a.ca','a@a.ca','d@d.ca','asdf@asdf.ca','something@a.ca']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})

donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)

donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

因此，这给了我每位捐赠者的第一笔捐赠(假设姓名和电子邮件完全相同的任何人都是同一个人).

So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).

理想情况下，我希望我的fundraisers数据集看起来像这样:

Ideally I want my fundraisers dataset to look like:

Date                Email       name        Donor Name  Donor Email Donor Date
2013-03-27 10:00:00     a@a.ca      John Doe    John Doe    a@a.ca      2013-03-27 10:00:00 
2013-01-03 10:39:00     a@a.ca      John Doe    John Doe    a@a.ca      2013-03-27 10:00:00 
2013-02-03 10:39:00     d@d.ca      Kathy test  Kat test    d@d.ca      2013-03-27 10:39:00 
2013-03-03 10:39:00     asdf@asdf.ca    Tes Ester   
2013-04-03 10:39:00     something@a.ca  Jane Doe    Jane Doe    something@a.ca  2013-04-03 10:39:00

我尝试了以下线程:

I tried following this thread: is it possible to do fuzzy match merge with python pandas? but keep getting index out of range errors (guessing it doesn't like the duplicated names in fundraisers) :( So any ideas how I can match/merge these datasets?

使用for循环(它可以工作，但是速度非常慢，我觉得必须有更好的方法)

doing it with for loops (which works but is super slow and I feel there has to be a better way)

代码:

fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
    max = 75
    for funindex in range(len(fundraisers.index)):
        aname = donors["name"][donindex]
        comp = fundraisers["name"][funindex]
        ratio = fuzz.ratio(aname, comp)
        if ratio > max:
            if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
                ratio *= 2
            max = ratio
            fundraisers["donor name"][funindex] = aname
            fundraisers["donor email"][funindex] = donors["Email"][donindex]
            fundraisers["donor date"][funindex] = donors["Date"][donindex]

推荐答案

(在我看来)这里有一些pythonic，可以(在您的示例中)运行代码，而没有显式循环:

Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:

def get_donors(row):
    d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*3
    else:
        v = donors.ix[d.idxmax(), ['name','Email','Date']].values
    return pd.Series(v, index=['donor name', 'donor email', 'donor date'])

pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)

输出:

                 Date           Email        name donor name     donor email           donor date
0 2013-03-27 10:00:00          a@a.ca    John Doe   John Doe          a@a.ca  2013-03-01 10:39:00
1 2013-03-01 10:39:00          a@a.ca    John Doe   John Doe          a@a.ca  2013-03-01 10:39:00
2 2013-03-02 10:39:00          d@d.ca  Kathy test   Kat test          d@d.ca  2013-03-27 10:39:00
3 2013-03-03 10:39:00    asdf@asdf.ca   Tes Ester                                                
4 2013-03-04 10:39:00  something@a.ca    Jane Doe   Jane Doe  something@a.ca  2013-03-04 10:39:00

这篇关于 pandas 模糊合并/匹配名称列，包含重复项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 模糊合并/匹配名称列，包含重复项 [英] Pandas fuzzy merge/match name column, with duplicates

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 模糊合并/匹配名称列，包含重复项 [英] Pandas fuzzy merge/match name column, with duplicates

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭