pandas 模糊合并/匹配名称列,包含重复项 [英] Pandas fuzzy merge/match name column, with duplicates

查看:78
本文介绍了 pandas 模糊合并/匹配名称列,包含重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有两个数据帧,一个用于donors,一个用于fundraisers.我正在尝试查找是否有任何fundraisers也进行了捐赠,如果有的话,请将其中的一些信息复制到我的fundraiser数据集中(捐赠者的姓名,电子邮件及其首次捐赠).我的数据存在以下问题:

I have two dataframes currently, one for donors and one for fundraisers. I'm trying to find if any fundraisers also gave donations, and if so, copy some of that information into my fundraiser dataset (donor name, email and their first donation). Problems with my data are:

  1. 我需要通过姓名和电子邮件进行匹配,但是用户的姓名可能略有不同(例如,"Kat"和"Kathy").
  2. donorsfundraisers的重复名称:
    • 2a)与捐赠者一起,我可以得到唯一的姓名/电子邮件组合,因为我只在乎首次捐赠日期
    • 2b)尽管有筹款人,但我需要保留两行,并且不要丢失日期等数据.
  1. I need to match by name and email, but a user might have slightly different names (ex 'Kat' and 'Kathy').
  2. Duplicate names for donors and fundraisers:
    • 2a) With donors I can get unique name/email combinations since I just care about the first donation date
    • 2b) With fundraisers though I need to keep both rows and not lose data like the date.

我现在有示例代码:

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['a@a.ca','a@a.ca','b@b.ca','c@c.ca','something@a.ca','d@d.ca']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['a@a.ca','a@a.ca','d@d.ca','asdf@asdf.ca','something@a.ca']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})

donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)

donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

因此,这给了我每位捐赠者的第一笔捐赠(假设姓名和电子邮件完全相同的任何人都是同一个人).

So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).

理想情况下,我希望我的fundraisers数据集看起来像这样:

Ideally I want my fundraisers dataset to look like:

Date                Email       name        Donor Name  Donor Email Donor Date
2013-03-27 10:00:00     a@a.ca      John Doe    John Doe    a@a.ca      2013-03-27 10:00:00 
2013-01-03 10:39:00     a@a.ca      John Doe    John Doe    a@a.ca      2013-03-27 10:00:00 
2013-02-03 10:39:00     d@d.ca      Kathy test  Kat test    d@d.ca      2013-03-27 10:39:00 
2013-03-03 10:39:00     asdf@asdf.ca    Tes Ester   
2013-04-03 10:39:00     something@a.ca  Jane Doe    Jane Doe    something@a.ca  2013-04-03 10:39:00

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆