循环遍历数据帧并从另一个数据帧获取相关数据:PANDAS [英] looping over a dataframe and fetching related data from another dataframe :PANDAS

查看:55
本文介绍了循环遍历数据帧并从另一个数据帧获取相关数据:PANDAS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含客户所有交易数据的数据框.列是mailid、txn_date、city.现在我有一种情况,我必须考虑 01jan2016 和每个 mailid 的客户,因为我必须从基本文件中获取他们的 txn 数据,并考虑他们最近 12 个月的数据(上次 Txn 日期和 -365 天时间增量之间的 txn 日期)然后找出他们最大的交易城市名称.

I have a dataframe having all transaction data of customers. columns are mailid,txn_date,city. Now I have a situation where I have to consider customer's from 01jan2016 and for each of mailid in that I have to fetch their txn data from base file and by considering their last 12 Month data(txn date between last Txn date and -365days timedelta) then finding out their max transacted city name.

示例基础数据框

#df
maild   txn_date   city
satya   2015-07-21  a
satya   2015-08-11  b
satya   2016-05-11  c
xyz     2016-06-01  f
satya   2016-06-01  a
satya   2016-06-01  b

因为我需要从 2016-01-01 开始,所以我做到了

As I need cust from 2016-01-01 so I did

d = df[['mailid', 'txn-date']][df['txn_date'] >= '2016-01-01']

现在对于 d 中的每个 mailid,我必须从基础 Dataframe df 中获取他们最近 12 个月的每个交易数据并计算他们交易的最大城市.为此,我使用了像

now for each mailid in d I have to fetch each of their last 12Month transaction data from base Dataframe df and calculate their max city transacted. For that I am using a for loop like

x = d.groupby(['mailid'])['txn-date'].max().reset_index() #### finding their last transacted date to find out a 12 Month back date
x['max_city'] = 'N'  ## giving default value as 'N'
for idx,row in x.iterrows():
 g = row[1].date()
 h = g-timedelta(days=365)  ###getting the last 12 Month date 
 y = df[(df['mailid']==row[0]) & (df['txn_date'] >= str(h))]
 y.sort(['txn_date'],ascending=True,inplace=True)  ### sorting it bcoz i want to consider the last txn when count for one or more cities become same 
 c = y.groupby(['mailid','city']).size().reset_index()
 v = c.groupby(['mailid'])[0].max().reset_index()
 dca = pd.merge(y,c,on=['mailid','city'],how='left')
 dcb = pd.merge(dca,v,on=['mailid',0])
 m = dcb.drop_duplicates(['mailid'],take_last=True)
 row[2] = m['city'].unique()[0]

o/p:

maild  max_city
satya   b   ### as in last 12 month 2016-06-01 2015-06-01  txn in a=2 b= 2 so last is b so will consider b as max city
xyz     f

尽管我的代码适用于小块数据(我确信它是无组织的,并且没有在我练习时使用正确的命名约定),但循环将针对数据帧 x 中存在的每个客户命中主基础数据帧 df.

Though my code works(I am sure it is un-organised and no proper naming convention used as I was practicing) for small chunk of data and the loop will hit the main Base dataFrame df for each customer present in dataframe x.

所以我主要关心的是如果我的 df 将是 100Mln 行而 x 将是 6mln Rows .然后 for 循环将执行 600 万次并命中 df 以获取匹配的 mailid 数据并执行查找最大交易城市的操作.

So my main concern is if my df will be of 100Mln rows and x will be of 6mln Rows . then the for loop will executed 6Mln times and will hit df to fetch matched mailid data and do the operation to find max transacted city.

如果在 1 分钟内它会计算出 3 个 mailid 的最大城市.那么600万需要200万分钟……这将是一个严重的问题……

if in 1 min it will calculate 3 mailid's max city. then for 6mln it will take 2mln minutes... which will be a serious problem...

所以需要你们提供关于如何优化场景的建议..从而减少主基地的击球次数并应用一些更方便的 PANDAS 方式来解决这个问题(我还无法做到)...

So need suggestion from you guys on how to optimize the scenario..thereby hitting main base fewer times and applying some more convenient PANDAS way to d that(which i am not able to do yet)...

请提出建议!!!!感谢 Adv.

Please, suggest!!!!Thanks in Adv.

推荐答案

这里有一个可能更容易阅读的替代方案:

Here's an alternative that might be a little easier to read:

def f(g):
    dc=g.groupby('city')['txn_date']
    dc_sorted=dc.agg(['count','max']).sort_values('max',ascending=False)
    return dc_sorted['count'].argmax()

d.groupby(['maild']).apply(f)

但是,我怀疑 apply 在非常大的表上会有一些性能问题.

However, I suspect that apply will have some performance issues on very large tables.

这篇关于循环遍历数据帧并从另一个数据帧获取相关数据:PANDAS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆