计算Pandas数据框中两个日期之间的GroupBy中的GroupBy行数 [英] Count Number of Rows GroupBy within a GroupBy Between Two Dates in Pandas Dataframe
问题描述
我有一个数据框df
,可以使用以下代码创建该数据框:
I have a dataframe df
, which can be created with the following code:
import random
from datetime import timedelta
import pandas as pd
import datetime
#create test range of dates
rng=pd.date_range(datetime.date(2015,7,15),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)],
'cid':[random.randint(1,2) for _ in testpts],
'ctid':[random.randint(3,4) for _ in testpts],
'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)[['jid','cid','ctid','stdt']]
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,16))
df
看起来像这样:
jid cid ctid stdt enddt
0 100 1 4 2015-07-28 2015-08-11
1 101 2 3 2015-07-31 2015-08-14
2 102 2 3 2015-07-31 2015-08-14
3 103 1 3 2015-07-24 2015-08-07
4 104 2 4 2015-07-27 2015-08-10
5 105 1 4 2015-07-27 2015-08-10
6 106 2 4 2015-07-24 2015-08-07
7 107 2 3 2015-07-22 2015-08-05
8 108 2 3 2015-07-28 2015-08-11
9 109 1 4 2015-07-20 2015-08-03
10 110 2 3 2015-07-29 2015-08-12
11 111 1 3 2015-07-29 2015-08-12
12 112 1 3 2015-07-27 2015-08-10
13 113 1 3 2015-07-21 2015-08-04
14 114 1 4 2015-07-28 2015-08-11
15 115 2 3 2015-07-28 2015-08-11
16 116 1 3 2015-07-26 2015-08-09
17 117 1 3 2015-07-25 2015-08-08
18 118 2 3 2015-07-26 2015-08-09
19 119 2 3 2015-07-19 2015-08-02
20 120 2 3 2015-07-22 2015-08-05
我需要做的是以下几点:计数(
cnt
)jid
的数量 对于ctid
到cid
之间的每个日期(newdate
),min(stdt)
和max(enddt)
,其中newdate
在stdt
和enddt
.
What I need to do is the following: Count (
cnt
) the number ofjid
that occur byctid
bycid
, for each date(newdate
) between themin(stdt)
andmax(enddt)
, where thenewdate
is between thestdt
and theenddt
.
生成的DataFrame应该看起来像(这仅适用于1 cid
且其中1 ctid
使用上述数据)(在这种情况下将复制cid
1/ctid
4,cid
2/ctid
3,cid
2/ctid
4):
That resulting DataFrame should look like (this is just for 1 cid
with 1 ctid
using above data)(this would replicate in this case for cid
1/ctid
4, cid
2/ctid
3, cid
2/ctid
4):
cid ctid newdate cnt
1 3 7/21/2015 1
1 3 7/22/2015 1
1 3 7/23/2015 1
1 3 7/24/2015 2
1 3 7/25/2015 3
1 3 7/26/2015 4
1 3 7/27/2015 5
1 3 7/28/2015 5
1 3 7/29/2015 6
1 3 7/30/2015 6
1 3 7/31/2015 6
1 3 8/1/2015 6
1 3 8/2/2015 6
1 3 8/3/2015 6
1 3 8/4/2015 6
1 3 8/5/2015 5
1 3 8/6/2015 5
1 3 8/7/2015 5
1 3 8/8/2015 4
1 3 8/9/2015 3
1 3 8/10/2015 2
1 3 8/11/2015 1
1 3 8/12/2015 1
上一个问题(也是我的问题)ctid的jid
大小,对于每个cid
,对于每个newdate
.喜欢您的意见...
This previous question (which was also mine) Count # of Rows Between Dates, was very similar, and was answered using pd.melt
. I am pretty sure melt
can be used again, or maybe there is a better option, but I can't figure out how to get the 'two layer groupby' accomplished which counts the size of jid
for each ctid
, for each cid
, for each newdate
. Love your inputs...
推荐答案
尝试@Scott Boston答案后,对于df为180万的记录,第一行
After trying @Scott Boston answer, for a 1.8m record df, the first line
df_out = pd.concat([pd.DataFrame(index=pd.date_range(df.iloc[i].stdt,df.iloc[i].enddt)).assign(**df.iloc[i,0:3]) for i in pd.np.arange(df.shape[0])]).reset_index()
1小时后
仍在运行,并逐渐消耗记忆.所以我想尝试以下方法:
was still running after 1 hour, and slowly eating away at memory. So I thought I'd try the following:
def reindex_by_date(df):
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates)
def replace_last_0(group):
group.loc[max(group.index),'change']=0
return group
def ctidloop(partdf):
coid=partdf.cid.max()
cols=['cid', 'stdt', 'enddt']
partdf=partdf[cols]
partdf['jid']=partdf.index
partdf = pd.melt(partdf, id_vars=['ctid', 'jid'],var_name='change', value_name='newdate')
partdf['change'] = partdf['change'].replace({'stdt': 1, 'enddt': -1})
partdf.newdate=pd.DatetimeIndex(partdf['newdate'])
partdf=partdf.groupby(['ctid', 'newdate'],as_index=False)['change'].sum()
partdf=partdf.groupby('ctid').apply(replace_last_0).reset_index(drop=True)
partdf['cnt'] = partdf.groupby('ctid')['change'].cumsum()
partdf.index=partdf['newdate']
cols=['ctid', 'change', 'cnt', 'newdate']
partdf=partdf[cols]
partdf=partdf.groupby('ctid').apply(reindex_by_date).reset_index(0, drop=True)
partdf['newdate']=partdf.index
partdf['ctid']=partdf['ctid'].fillna(method='ffill')
partdf.cnt=partdf.cnt.fillna(method='ffill')
partdf.change=partdf.change.fillna(0)
partdf['cid']=coid
return partdf
gb=df.groupby('cid').apply(ctidloop)
此代码在以下位置返回了正确的结果
This code returned the correct result in:
%timeit gb=df.groupby('cid').apply(ctidloop)
1 loop, best of 3: 9.74 s per loop
说明: 基本上,
melt
是非常快的.所以我想只是将第一个groupby
分成几组并对其运行一个函数.因此,此代码使用df
,然后使用groupsby
cid
和apply
函数cidloop
.
EXPLANATION: Basically,
melt
is very quick. So I figured just break the firstgroupby
up into groups and run a function on it. So This code takes thedf
, thengroupsby
thecid
andapply
the functioncidloop
.
在cidloop
中,按行进行以下操作:
1)抓住cid
供将来使用.
2,3)通过分配所需的列来建立要处理的核心partdf
4)从索引创建jid
5)运行pd.melt
,通过为stdt
和enddt
的每个jid
创建一行来展平数据框.
6)创建一个'change'
列,该列将+1分配给stdt
,将-1分配给enddt
.
7)将newdate
变成datetimeindex
(只是更容易进行进一步处理)
8)将ctid
和newdate
分组,将change
求和
9)再次按ctid
进行分组,将最后一个值替换为0(这只是我不需要专门针对此问题的内容)
10)按ctid
和cumsumming
分组创建cnt
11)从newdate
制作新索引
12,13)格式化列/名称
14)在ctid
上的另一个groupby,但按高和低日期重新索引,从而填补了空白.
15)从新的reindex
值分配newdate
16,17,18)填充各种值以填补空白(我需要此增强功能)
19)从第1行中收集的顶部变量coid
再次分配cid
.
In the cidloop
, the following happens by line:
1) Grab the cid
for future use.
2,3) establish core partdf
to process by assigning needed columns
4) create jid
from the index
5) run the pd.melt
which flattens the dataframe by creating a row for each jid
for stdt
and enddt
.
6) creates a 'change'
column which assigns +1 to stdt
, and -1 to enddt
.
7) makes newdate
a datetimeindex
(just easier for further processing)
8) groups what we have by ctid
and newdate
, summing the change
9) groups by ctid
again, replacing the last value with 0 (this is just something I needed not specific to the problem)
10) creates cnt
by group by ctid
and cumsumming
the change
11)makes the new index from the newdate
12,13) formats columns/names
14) another groupby on ctid
but reindexing by hi and low dates, filling the gaps.
15) assign newdate
from the new reindex
values
16,17,18) fill various values to fill gaps (I needed this enhancement)
19) assign cid
again from the top variable coid
gathered in line 1.
在代码gb=df.groupby.....
的最后一行中为每个cid
执行此操作
Do this for each cid
through the last line of code gb=df.groupby.....
感谢@Scott Boston的尝试.当然可以,但是对我来说花费了太长时间.
Thanks @Scott Boston for the attempt. Sure it works but took too long for me.
Kudos to @DSM for his solution HERE which was the basis of my solution.
这篇关于计算Pandas数据框中两个日期之间的GroupBy中的GroupBy行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!