pandas 模糊合并/匹配名称列,包含重复项 [英] Pandas fuzzy merge/match name column, with duplicates
问题描述
我目前有两个数据帧,一个用于donors
,一个用于fundraisers
.我正在尝试查找是否有任何fundraisers
也进行了捐赠,如果有的话,请将其中的一些信息复制到我的fundraiser
数据集中(捐赠者的姓名,电子邮件及其首次捐赠).我的数据存在以下问题:
I have two dataframes currently, one for donors
and one for fundraisers
. I'm trying to find if any fundraisers
also gave donations, and if so, copy some of that information into my fundraiser
dataset (donor name, email and their first donation). Problems with my data are:
- 我需要通过姓名和电子邮件进行匹配,但是用户的姓名可能略有不同(例如,"Kat"和"Kathy").
-
donors
和fundraisers
的重复名称:- 2a)与捐赠者一起,我可以得到唯一的姓名/电子邮件组合,因为我只在乎首次捐赠日期
- 2b)尽管有筹款人,但我需要保留两行,并且不要丢失日期等数据.
- I need to match by name and email, but a user might have slightly different names (ex 'Kat' and 'Kathy').
- Duplicate names for
donors
andfundraisers
:- 2a) With donors I can get unique name/email combinations since I just care about the first donation date
- 2b) With fundraisers though I need to keep both rows and not lose data like the date.
我现在有示例代码:
import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib
donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['a@a.ca','a@a.ca','b@b.ca','c@c.ca','something@a.ca','d@d.ca']),"Date": (["27/03/2013 10:00:00 AM","1/03/2013 10:39:00 AM","2/03/2013 10:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:39:00 AM","27/03/2013 10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['a@a.ca','a@a.ca','d@d.ca','asdf@asdf.ca','something@a.ca']),"Date": pd.Series(["2/03/2013 10:39:00 AM","27/03/2013 11:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:40:00 AM","27/03/2013 10:39:00 AM"])})
donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)
因此,这给了我每位捐赠者的第一笔捐赠(假设姓名和电子邮件完全相同的任何人都是同一个人).
So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).
理想情况下,我希望我的fundraisers
数据集看起来像这样:
Ideally I want my fundraisers
dataset to look like:
Date Email name Donor Name Donor Email Donor Date
2013-03-27 10:00:00 a@a.ca John Doe John Doe a@a.ca 2013-03-27 10:00:00
2013-01-03 10:39:00 a@a.ca John Doe John Doe a@a.ca 2013-03-27 10:00:00
2013-02-03 10:39:00 d@d.ca Kathy test Kat test d@d.ca 2013-03-27 10:39:00
2013-03-03 10:39:00 asdf@asdf.ca Tes Ester
2013-04-03 10:39:00 something@a.ca Jane Doe Jane Doe something@a.ca 2013-04-03 10:39:00
-
I tried following this thread: is it possible to do fuzzy match merge with python pandas? but keep getting index out of range errors (guessing it doesn't like the duplicated names in fundraisers) :( So any ideas how I can match/merge these datasets?
使用for循环(它可以工作,但是速度非常慢,我觉得必须有更好的方法)
doing it with for loops (which works but is super slow and I feel there has to be a better way)
代码:
fundraisers["donor name"] = "" fundraisers["donor email"] = "" fundraisers["donor date"] = "" for donindex in range(len(donors.index)): max = 75 for funindex in range(len(fundraisers.index)): aname = donors["name"][donindex] comp = fundraisers["name"][funindex] ratio = fuzz.ratio(aname, comp) if ratio > max: if (donors["Email"][donindex] == fundraisers["Email"][funindex]): ratio *= 2 max = ratio fundraisers["donor name"][funindex] = aname fundraisers["donor email"][funindex] = donors["Email"][donindex] fundraisers["donor date"][funindex] = donors["Date"][donindex]
推荐答案
(在我看来)这里有一些pythonic,可以(在您的示例中)运行代码,而没有显式循环:
Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:
def get_donors(row): d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1) d = d[d >= 75] if len(d) == 0: v = ['']*3 else: v = donors.ix[d.idxmax(), ['name','Email','Date']].values return pd.Series(v, index=['donor name', 'donor email', 'donor date']) pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)
输出:
Date Email name donor name donor email donor date 0 2013-03-27 10:00:00 a@a.ca John Doe John Doe a@a.ca 2013-03-01 10:39:00 1 2013-03-01 10:39:00 a@a.ca John Doe John Doe a@a.ca 2013-03-01 10:39:00 2 2013-03-02 10:39:00 d@d.ca Kathy test Kat test d@d.ca 2013-03-27 10:39:00 3 2013-03-03 10:39:00 asdf@asdf.ca Tes Ester 4 2013-03-04 10:39:00 something@a.ca Jane Doe Jane Doe something@a.ca 2013-03-04 10:39:00
这篇关于 pandas 模糊合并/匹配名称列,包含重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!