计算两个日期之间的行数在Pandas GroupBy数据框中的BY ID [英] Count Number of Rows Between Two Dates BY ID in a Pandas GroupBy Dataframe

查看:364
本文介绍了计算两个日期之间的行数在Pandas GroupBy数据框中的BY ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下测试DataFrame:

 导入随机
从datetime import timedelta
import pandas作为pd
import datetime

#create测试日期范围
rng = pd.date_range(datetime.date(2015,1,1),datetime.date(2015,7 ,31))
rnglist = rng.tolist()
testpts = range(100,121)
#create test dataframe
d = {'jid':[i for range in range 100121)],'cid':[在testpts中的_ random.randint(1,2)],
'stdt':[rnglist [random.randint(0,len(rng))] ]}
df = pd.DataFrame(d)
df ['enddt'] = df ['stdt'] + timedelta(days = random.randint(2,32))

其中给出了如下所示的数据框,公司id列为cid,唯一的id列为jid,开始日期'stdt'和enddt'enddt'。

  cid jid stdt enddt 
0 1 100 2015- 07-06 2015-07-13
1 1 101 2015-07-15 2015-07-22
2 2 102 2015-07-12 2015-07-19
3 2 103 2015-07-07 2015-07-14
4 2 104 2015-07-14 2015-07-21
5 1 105 2015-07-11 2015-07- 18
6 1 106 2015-07-12 2015-07-19
7 2 107 2015-07-01 2015-07-08
8 2 108 2015-07-10 2015-07 -17
9 2 109 2015-07-09 2015-07-16




我需要做的是以下内容:计算cid发生的jid数,对于min(stdt)
和max之间的每个日期(newdate) enddt),其中newdate在stdt和
之间的enddt。


生成的数据集应该是一个数据框,对于每个cid,对于每个cid特定的min(stdt)和max(enddt)之间的日期范围(newdate)以及新日期之间的jid数量的count(cnt) min(stdt)和max(enddt)。所得到的DataFrame应该看起来像(这只是使用上面的数据1 cid):

  cid newdate cnt 
1 2015-07-06 1
1 2015-07-07 1
1 2015-07-08 1
1 2015-07-09 1
1 2015-07-10 1
1 2015-07-11 2
1 2015-07-12 3
1 2015-07-13 3
1 2015-07-14 2
1 2015 -07-15 3
1 2015-07-16 3
1 2015-07-17 3
1 2015-07-18 3
1 2015-07-19 2
1 2015-07-20 1
1 2015-07-21 1
1 2015-07-22 1

我相信应该有一种方法来使用大熊猫groupby(groupby cid),还有一些形式的lambda(?)来蟒蛇创建这个新的数据框。



我目前运行一个循环,为每个cid(我将cid行从主df中切出),在循环中确定相关日期范围(最小stdt和max enddt为每个cid框架,然后对于每个新的日期(range mindate-maxdate),它会计算每个jid的stdt和enddt之间的newdate的数量,然后我将每个生成的数据集附加到一个新的数据框,看起来像但是,从资源和时间的角度来看,这是非常昂贵的,在数以千计的千字节上,要做到这一点,字面上需要一整天的时间,我希望在那里这是一个简单的(r)熊猫解决方案。

解决方案

我对于这些问题的通常方法是围绕事件进行转换和思考我们看到的每一个新的stdt都增加了+1,我们看到的每一个enddt都增加了-1(第二天添加了-1),至少如果我用你之间的有些日子,我想我们应该禁止你这个词太含糊了..)



IOW,如果我们把你的框架转换成像

 >>> df.head()
cid更改日期
0 1 100 1 2015-01-06
1 1 101 1 2015-01-07
21 1 100 -1 2015- 01-16
22 1 101 -1 2015-01-17
17 1 117 1 2015-03-01

那么我们想要的只是简单的变化的累积和(在适当的重新组合之后)例如,像

  df [enddt] + = timedelta(days = 1)
df = pd.melt(df,id_vars = [cid ,jid],var_name =change,value_name =date)
df [change] = df [change] replace({stdt:1,enddt 1})
df = df.sort([cid,date])

df = df.groupby([cid,date],as_index = False )[change]。sum()
df [count] = df.groupby(cid)[change]。cumsum()

new_time = pd。 date_range(df.date.min(),df.date.max())

df_parts = []
for cid,group in df.groupby(cid):
full_count = group [[date,count]]。set_index(date)
full_count = full_cou nt.reindex(new_time)
full_count = full_count.ffill()。fillna(0)
full_count [cid] = cid
df_parts.append(full_count)

df_new = pd.concat(df_parts)

这给我一些像

 >>> df_new.head(15)
count cid
2015-01-03 0 1
2015-01-04 0 1
2015-01-05 0 1
2015 -01-06 1 1
2015-01-07 2 1
2015-01-08 2 1
2015-01-09 2 1
2015-01-10 2 1
2015-01-11 2 1
2015-01-12 2 1
2015-01-13 2 1
2015-01-14 2 1
2015- 01-15 2 1
2015-01-16 1 1
2015-01-17 0 1

对于您的期望,可能会有一个不同的差异;您可能会在同一时间窗口中处理多重重叠的 jid 可能有不同的想法(这里它们将算作2);但是,即使你必须调整细节,使用这些事件的基本思想也应该是有用的。


I have the following test DataFrame:

import random
from datetime import timedelta
import pandas as pd
import datetime

#create test range of dates
rng=pd.date_range(datetime.date(2015,1,1),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)], 'cid':[random.randint(1,2) for _ in testpts],
    'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,32))

Which gives a dataframe like the below, with a company id column 'cid', a unique id column 'jid', a start date 'stdt', and an enddt 'enddt'.

   cid  jid       stdt      enddt
0    1  100 2015-07-06 2015-07-13
1    1  101 2015-07-15 2015-07-22
2    2  102 2015-07-12 2015-07-19
3    2  103 2015-07-07 2015-07-14
4    2  104 2015-07-14 2015-07-21
5    1  105 2015-07-11 2015-07-18
6    1  106 2015-07-12 2015-07-19
7    2  107 2015-07-01 2015-07-08
8    2  108 2015-07-10 2015-07-17
9    2  109 2015-07-09 2015-07-16

What I need to do is the following: Count the number of jid that occur by cid, for each date(newdate) between the min(stdt) and max(enddt), where the newdate is between the stdt and the enddt.

The resulting data set should be a dataframe that has for each cid, a column range of dates (newdate) that is between the min(stdt) and the max(enddt) specific to each cid, and a count (cnt) of the number of jid that the newdate is between of the min(stdt) and max(enddt). That resulting DataFrame should look like (this is just for 1 cid using above data):

cid newdate cnt
1   2015-07-06  1
1   2015-07-07  1
1   2015-07-08  1
1   2015-07-09  1
1   2015-07-10  1
1   2015-07-11  2
1   2015-07-12  3
1   2015-07-13  3
1   2015-07-14  2
1   2015-07-15  3
1   2015-07-16  3
1   2015-07-17  3
1   2015-07-18  3
1   2015-07-19  2
1   2015-07-20  1
1   2015-07-21  1
1   2015-07-22  1

I believe there should be a way to use pandas groupby (groupby cid), and some form of lambda(?) to pythonically create this new dataframe.

I currently run a loop that for each cid (I slice the cid rows out of the master df), in the loop determine the relevant date range (min stdt and max enddt for each cid frame, then for each of those newdates (range mindate-maxdate) it counts the number of jid where the newdate is between the stdt and enddt of each jid. Then I append each resulting dataset into a new dataframe which looks like the above.

But this is very expensive from a resource and time perspective. Doing this on millions of jid for thousands of cid literally takes a full day. I am hoping there is a simple(r) pandas solution here.

解决方案

My usual approach for these problems is to pivot and think in terms of events changing an accumulator. Every new "stdt" we see adds +1 to the count; every "enddt" we see adds -1. (Adds -1 the next day, at least if I'm interpreting "between" the way you are. Some days I think we should ban the use of the word as too ambiguous..)

IOW, if we turn your frame to something like

>>> df.head()
    cid  jid  change       date
0     1  100       1 2015-01-06
1     1  101       1 2015-01-07
21    1  100      -1 2015-01-16
22    1  101      -1 2015-01-17
17    1  117       1 2015-03-01

then what we want is simply the cumulative sum of change (after suitable regrouping.) For example, something like

df["enddt"] += timedelta(days=1)
df = pd.melt(df, id_vars=["cid", "jid"], var_name="change", value_name="date")
df["change"] = df["change"].replace({"stdt": 1, "enddt": -1})
df = df.sort(["cid", "date"])

df = df.groupby(["cid", "date"],as_index=False)["change"].sum()
df["count"] = df.groupby("cid")["change"].cumsum()

new_time = pd.date_range(df.date.min(), df.date.max())

df_parts = []
for cid, group in df.groupby("cid"):
    full_count = group[["date", "count"]].set_index("date")
    full_count = full_count.reindex(new_time)
    full_count = full_count.ffill().fillna(0)
    full_count["cid"] = cid
    df_parts.append(full_count)

df_new = pd.concat(df_parts)

which gives me something like

>>> df_new.head(15)
            count  cid
2015-01-03      0    1
2015-01-04      0    1
2015-01-05      0    1
2015-01-06      1    1
2015-01-07      2    1
2015-01-08      2    1
2015-01-09      2    1
2015-01-10      2    1
2015-01-11      2    1
2015-01-12      2    1
2015-01-13      2    1
2015-01-14      2    1
2015-01-15      2    1
2015-01-16      1    1
2015-01-17      0    1

There may be off-by-one differences with regards to your expectations; you may have different ideas about how you should handle multiple overlapping jids in the same time window (here they would count as 2); but the basic idea of working with the events should prove useful even if you have to tweak the details.

这篇关于计算两个日期之间的行数在Pandas GroupBy数据框中的BY ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆