使用 Pandas 聚合具有开始和结束时间的事件 [英] Aggregate events with start and end times with Pandas

查看:56
本文介绍了使用 Pandas 聚合具有开始和结束时间的事件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些事件的数据,开始和结束时间如下:

I have data for a number of events with start and end times like this:

df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])

出:

         end      start  value
0 2015-01-07 2015-01-05      3
1 2015-01-15 2015-01-10      4
2 2015-01-13 2015-01-11      5

现在我需要计算同时活动的事件数,例如.它们的值的总和.所以结果应该是这样的:

Now I need to calculate the number of events active at the same time, and eg. the sum of their values. So the result should look something like this:

      date  count   sum
2015-01-05      1     3
2015-01-06      1     3
2015-01-07      1     3
2015-01-08      0     0
2015-01-09      0     0
2015-01-10      1     4
2015-01-11      2     9
2015-01-12      2     9
2015-01-13      2     9
2015-01-14      1     4
2015-01-15      1     4

关于如何做到这一点的任何想法?我正在考虑为 groupby 使用自定义 Grouper,但据我所知,Grouper 只能将一行分配给单个组,因此看起来没什么用.

Any ideas for how to do this? I was thinking about using a custom Grouper for groupby, but as far as I can see a Grouper can only assign a row to a single group so that doesn't look useful.

经过一些测试后,我发现这是获得所需结果的相当丑陋的方法:

After some testing I found this rather ugly way to get the desired result:

df['count'] = 1
dates = pd.date_range('2015-01-05', '2015-01-15', freq='1D')

start = df[['start', 'value', 'count']].set_index('start').reindex(dates)
end = df[['end', 'value', 'count']].set_index('end').reindex(dates).shift(1)

rstart = pd.rolling_sum(start, len(start), min_periods=1)
rend = pd.rolling_sum(end, len(end), min_periods=1)

rstart.subtract(rend, fill_value=0).fillna(0)

然而,这只适用于总和,我看不出有什么明显的方法可以让它与其他函数一起使用.例如,有没有办法让它使用中位数而不是总和?

However, this only works with sums, and I can't see an obvious way to make it work with other functions. For example, is there a way to get it to work with median instead of sum?

推荐答案

如果我使用 SQL,我会通过将所有日期表连接到事件表,然后按日期分组来实现.Pandas 并没有让这种方法特别容易,因为没有办法在条件上进行左连接,但我们可以使用虚拟列和重新索引来伪造它:

If I were using SQL, I would do this by joining an all-dates table to the events table, and then grouping by date. Pandas doesn't make this approach especially easy, since there's no way to left-join on a condition, but we can fake it using dummy columns and reindexing:

df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
df['dummy'] = 1

那么:

date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))

cross_join = date_df.merge(df, on='dummy')
cond_join = cross_join[(cross_join.start <= cross_join.date) & (cross_join.date <= cross_join.end)]
grp_join = cond_join.groupby(['date'])
final = (
    pd.DataFrame(dict(
        val_count=grp_join.size(),
        val_sum=grp_join.value.sum(),
        val_median=grp_join.value.median()
    ), index=date_series)
    .fillna(0)
    .reset_index()
)

fillna(0) 并不完美,因为它使 val_median 列中的空值变为 0,而实际上它们应该保持为空值.

The fillna(0) isn't perfect, since it makes nulls in the val_median column into 0s, when they should really remain nulls.

或者,使用 pandas-ply 我们可以将其编码为:

Alternatively, with pandas-ply we can code that up as:

date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))

final = (
    date_df
    .merge(df, on='dummy')
    .ply_where(X.start <= X.date, X.date <= X.end)
    .groupby('date')
    .ply_select(val_count=X.size(), val_sum=X.value.sum(), median=X.value.median())
    .reindex(date_series)
    .ply_select('*', val_count=X.val_count.fillna(0), val_sum=X.val_sum.fillna(0))
    .reset_index()
)

它可以更好地处理空值.

which handles nulls a bit better.

这篇关于使用 Pandas 聚合具有开始和结束时间的事件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆