创建一个Pandas数据框,其中包含跨越日期范围的项目数 [英] Create a Pandas dataframe with counts of items spanning a date range

查看:100
本文介绍了创建一个Pandas数据框,其中包含跨越日期范围的项目数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有两个感兴趣日期的DF,看起来像:

I have a DF that has two dates of interest that looks kind of like:

LIST_DATE     END_DATE
2000-04-18    2000-05-17 00:00:00
2000-05-18    2000-09-18 00:00:00
2000-04-18    2001-06-07 00:00:00

然后我按月创建了一个期间索引表"montot",该表目前仅具有月份和年份索引

And I created a Period index table "montot" by month that currently only has the month and year index

<class 'pandas.tseries.period.PeriodIndex'>
freq: M
[1999-01, ..., 2013-07]

我想要做的是对第二个表中的每个月"montot"计数在该时间段内的第一个表中的项目(可能是按月列出的活动列表),然后将该字段添加到表中. ..因此,例如,第一个表中的第一个项目将在第4个月中计数为1,第5个月为一次,而第二个项目将在第5个月至第9个月中计数为1,等等.表/字段. 所以我有一张桌子

What I want to do is to for each month in the second table "montot" count the items in the 1st table which fall within the time periods (happens to be active listings by month) and add that field to the table... so for instance the 1st item in the 1st table would be counted 1 in month 4 and once in month 5 while the second item woul be counted once in month 5 through month 9 etc..with the monthly total being recorded in the new table/field. So I'll have a table

Month    active
1/1999     5
2/1999     8

等.还不知道如何使用Pandas/Python来实现它...

etc.. Have not got a clue how to approach it with Pandas/Python...

推荐答案

这里是一种方法,首先

Here's one way to do it, first value_counts the periods in each of the date columns (using the to_period Timestamp method):

In [11]: p = pd.PeriodIndex(freq='m', start='2000-1', periods=18)

In [12]: starts = df['LIST_DATE'].apply(lambda t: t.to_period(freq='m')).value_counts()

In [13]: ends = df['END_DATE'].apply(lambda t: t.to_period(freq='m')).value_counts()

通过PeriodIndex将它们重新索引,填写NaN(以便您可以减去),并从累积结束数中选择累积开始数,以为您提供当前活动状态:

Reindex these by the PeriodIndex, fill in the NaNs (so you can subtract) and take the cumulative started from the cumulative ended, to give you the currently active:

In [14]: starts.reindex(p).fillna(0).cumsum() - ends.reindex(p).fillna(0).cumsum()
Out[14]: 
2000-01    0
2000-02    0
2000-03    0
2000-04    2
2000-05    2
2000-06    2
2000-07    2
2000-08    2
2000-09    1
2000-10    1
2000-11    1
2000-12    1
2001-01    1
2001-02    1
2001-03    1
2001-04    1
2001-05    1
2001-06    0
Freq: M, dtype: float64

最后一个替代步骤是创建一个DataFrame(最初跟踪更改,因此开始为正,结束为负):

An alternative final step is to create a DataFrame (which initially tracks changes, hence starts is positive and ends negative):

In [21]: current = pd.DataFrame({'starts': starts, 'ends': -ends}, p)

In [22]: current
Out[22]:
         ends  starts
2000-01   NaN     NaN
2000-02   NaN     NaN
2000-03   NaN     NaN
2000-04   NaN       2
2000-05    -1       1
2000-06   NaN     NaN
2000-07   NaN     NaN
2000-08   NaN     NaN
2000-09    -1     NaN
2000-10   NaN     NaN
2000-11   NaN     NaN
2000-12   NaN     NaN
2001-01   NaN     NaN
2001-02   NaN     NaN
2001-03   NaN     NaN
2001-04   NaN     NaN
2001-05   NaN     NaN
2001-06    -1     NaN

In [23]: current.fillna(0)
Out[23]:
         ends  starts
2000-01     0       0
2000-02     0       0
2000-03     0       0
2000-04     0       2
2000-05    -1       1
2000-06     0       0
2000-07     0       0
2000-08     0       0
2000-09    -1       0
2000-10     0       0
2000-11     0       0
2000-12     0       0
2001-01     0       0
2001-02     0       0
2001-03     0       0
2001-04     0       0
2001-05     0       0
2001-06    -1       0 

cumsum跟踪开始和结束到该点的运行总计:

The cumsum track the running totals of starts and ends up to that point:

In [24]: current.fillna(0).cumsum()
Out[24]:
         ends  starts
2000-01     0       0
2000-02     0       0
2000-03     0       0
2000-04     0       2
2000-05    -1       3
2000-06    -1       3
2000-07    -1       3
2000-08    -1       3
2000-09    -2       3
2000-10    -2       3
2000-11    -2       3
2000-12    -2       3
2001-01    -2       3
2001-02    -2       3
2001-03    -2       3
2001-04    -2       3
2001-05    -2       3
2001-06    -3       3 

将这些列加在一起,得出当前处于活动状态的列,结果与上面相同:

And summing these columns together, gives those currently active, and is same result as above:

In [25]: current.fillna(0).cumsum().sum(1)

这篇关于创建一个Pandas数据框,其中包含跨越日期范围的项目数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆