计算Pandas数据框中两个日期之间的GroupBy中的GroupBy行数 [英] Count Number of Rows GroupBy within a GroupBy Between Two Dates in Pandas Dataframe

查看:91
本文介绍了计算Pandas数据框中两个日期之间的GroupBy中的GroupBy行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框df,可以使用以下代码创建该数据框:

I have a dataframe df, which can be created with the following code:

import random
from datetime import timedelta
import pandas as pd
import datetime

#create test range of dates
rng=pd.date_range(datetime.date(2015,7,15),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)], 
   'cid':[random.randint(1,2) for _ in testpts],
   'ctid':[random.randint(3,4) for _ in testpts],       
    'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)[['jid','cid','ctid','stdt']]
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,16))

df看起来像这样:

      jid  cid  ctid       stdt      enddt
0   100    1     4 2015-07-28 2015-08-11
1   101    2     3 2015-07-31 2015-08-14
2   102    2     3 2015-07-31 2015-08-14
3   103    1     3 2015-07-24 2015-08-07
4   104    2     4 2015-07-27 2015-08-10
5   105    1     4 2015-07-27 2015-08-10
6   106    2     4 2015-07-24 2015-08-07
7   107    2     3 2015-07-22 2015-08-05
8   108    2     3 2015-07-28 2015-08-11
9   109    1     4 2015-07-20 2015-08-03
10  110    2     3 2015-07-29 2015-08-12
11  111    1     3 2015-07-29 2015-08-12
12  112    1     3 2015-07-27 2015-08-10
13  113    1     3 2015-07-21 2015-08-04
14  114    1     4 2015-07-28 2015-08-11
15  115    2     3 2015-07-28 2015-08-11
16  116    1     3 2015-07-26 2015-08-09
17  117    1     3 2015-07-25 2015-08-08
18  118    2     3 2015-07-26 2015-08-09
19  119    2     3 2015-07-19 2015-08-02
20  120    2     3 2015-07-22 2015-08-05

我需要做的是以下几点:计数(cnt)jid的数量 对于ctidcid之间的每个日期(newdate), min(stdt)max(enddt),其中newdatestdtenddt.

What I need to do is the following: Count (cnt) the number of jid that occur by ctid by cid, for each date(newdate) between the min(stdt) and max(enddt), where the newdate is between the stdt and the enddt.

生成的DataFrame应该看起来像(这仅适用于1 cid且其中1 ctid使用上述数据)(在这种情况下将复制cid 1/ctid 4,cid 2/ctid 3,cid 2/ctid 4):

That resulting DataFrame should look like (this is just for 1 cid with 1 ctid using above data)(this would replicate in this case for cid 1/ctid 4, cid 2/ctid 3, cid 2/ctid 4):

cid ctid    newdate cnt
1   3   7/21/2015   1
1   3   7/22/2015   1
1   3   7/23/2015   1
1   3   7/24/2015   2
1   3   7/25/2015   3
1   3   7/26/2015   4
1   3   7/27/2015   5
1   3   7/28/2015   5
1   3   7/29/2015   6
1   3   7/30/2015   6
1   3   7/31/2015   6
1   3   8/1/2015    6
1   3   8/2/2015    6
1   3   8/3/2015    6
1   3   8/4/2015    6
1   3   8/5/2015    5
1   3   8/6/2015    5
1   3   8/7/2015    5
1   3   8/8/2015    4
1   3   8/9/2015    3
1   3   8/10/2015   2
1   3   8/11/2015   1
1   3   8/12/2015   1

上一个问题(也是我的问题)ctid的jid大小,对于每个cid,对于每个newdate.喜欢您的意见...

This previous question (which was also mine) Count # of Rows Between Dates, was very similar, and was answered using pd.melt. I am pretty sure melt can be used again, or maybe there is a better option, but I can't figure out how to get the 'two layer groupby' accomplished which counts the size of jid for each ctid, for each cid, for each newdate. Love your inputs...

推荐答案

尝试@Scott Boston答案后,对于df为180万的记录,第一行

After trying @Scott Boston answer, for a 1.8m record df, the first line

df_out = pd.concat([pd.DataFrame(index=pd.date_range(df.iloc[i].stdt,df.iloc[i].enddt)).assign(**df.iloc[i,0:3]) for i in pd.np.arange(df.shape[0])]).reset_index()

1小时后

仍在运行,并逐渐消耗记忆.所以我想尝试以下方法:

was still running after 1 hour, and slowly eating away at memory. So I thought I'd try the following:

def reindex_by_date(df):
    dates = pd.date_range(df.index.min(), df.index.max())
    return df.reindex(dates)
def replace_last_0(group):
    group.loc[max(group.index),'change']=0
    return group

def ctidloop(partdf): 
        coid=partdf.cid.max()
        cols=['cid', 'stdt', 'enddt']
        partdf=partdf[cols]
        partdf['jid']=partdf.index
        partdf = pd.melt(partdf, id_vars=['ctid', 'jid'],var_name='change', value_name='newdate')
        partdf['change'] = partdf['change'].replace({'stdt': 1, 'enddt': -1})
        partdf.newdate=pd.DatetimeIndex(partdf['newdate'])
        partdf=partdf.groupby(['ctid', 'newdate'],as_index=False)['change'].sum()
        partdf=partdf.groupby('ctid').apply(replace_last_0).reset_index(drop=True)
        partdf['cnt'] = partdf.groupby('ctid')['change'].cumsum()
        partdf.index=partdf['newdate']
        cols=['ctid', 'change', 'cnt', 'newdate']
        partdf=partdf[cols]
        partdf=partdf.groupby('ctid').apply(reindex_by_date).reset_index(0, drop=True)
        partdf['newdate']=partdf.index
        partdf['ctid']=partdf['ctid'].fillna(method='ffill')
        partdf.cnt=partdf.cnt.fillna(method='ffill')
        partdf.change=partdf.change.fillna(0)
        partdf['cid']=coid
        return partdf
gb=df.groupby('cid').apply(ctidloop)

此代码在以下位置返回了正确的结果

This code returned the correct result in:

%timeit gb=df.groupby('cid').apply(ctidloop)
1 loop, best of 3: 9.74 s per loop 

说明: 基本上,melt是非常快的.所以我想只是将第一个groupby分成几组并对其运行一个函数.因此,此代码使用df,然后使用groupsby cidapply函数cidloop.

EXPLANATION: Basically, melt is very quick. So I figured just break the first groupby up into groups and run a function on it. So This code takes the df, then groupsby the cid and apply the function cidloop.

cidloop中,按行进行以下操作: 1)抓住cid供将来使用. 2,3)通过分配所需的列来建立要处理的核心partdf 4)从索引创建jid 5)运行pd.melt,通过为stdtenddt的每个jid创建一行来展平数据框. 6)创建一个'change'列,该列将+1分配给stdt,将-1分配给enddt. 7)将newdate变成datetimeindex(只是更容易进行进一步处理) 8)将ctidnewdate分组,将change求和 9)再次按ctid进行分组,将最后一个值替换为0(这只是我不需要专门针对此问题的内容) 10)按ctidcumsumming分组创建cnt 11)从newdate制作新索引 12,13)格式化列/名称 14)在ctid上的另一个groupby,但按高和低日期重新索引,从而填补了空白. 15)从新的reindex值分配newdate 16,17,18)填充各种值以填补空白(我需要此增强功能) 19)从第1行中收集的顶部变量coid再次分配cid.

In the cidloop, the following happens by line: 1) Grab the cid for future use. 2,3) establish core partdf to process by assigning needed columns 4) create jid from the index 5) run the pd.melt which flattens the dataframe by creating a row for each jid for stdt and enddt. 6) creates a 'change' column which assigns +1 to stdt, and -1 to enddt. 7) makes newdate a datetimeindex (just easier for further processing) 8) groups what we have by ctid and newdate, summing the change 9) groups by ctid again, replacing the last value with 0 (this is just something I needed not specific to the problem) 10) creates cnt by group by ctid and cumsumming the change 11)makes the new index from the newdate 12,13) formats columns/names 14) another groupby on ctid but reindexing by hi and low dates, filling the gaps. 15) assign newdate from the new reindex values 16,17,18) fill various values to fill gaps (I needed this enhancement) 19) assign cid again from the top variable coid gathered in line 1.

在代码gb=df.groupby.....的最后一行中为每个cid执行此操作

Do this for each cid through the last line of code gb=df.groupby.....

感谢@Scott Boston的尝试.当然可以,但是对我来说花费了太长时间.

Thanks @Scott Boston for the attempt. Sure it works but took too long for me.

对于@DSM的解决方案表示敬意,

Kudos to @DSM for his solution HERE which was the basis of my solution.

这篇关于计算Pandas数据框中两个日期之间的GroupBy中的GroupBy行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆