pandas 按时间窗分组 [英] Pandas group by time windows

查看:152
本文介绍了 pandas 按时间窗分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过使用熊猫进行日志文件分析生成会话似乎正是我想要的.

我有一个包含非唯一时间戳的数据框,我想按时间窗口对它们进行分组.基本逻辑是-

I have a dataframe that includes non-unique time stamps, and I'd like to group them by time windows. The basic logic would be -

1)通过在时间戳之前和之后添加n分钟,从每个时间戳创建一个时间范围.

1) Create a time range from each time stamp by adding n minutes before and after the time stamp.

2)按重叠的时间范围分组.最终的结果是,时间窗口将小到单个时间戳+/-时间缓冲区,但是只要多个事件之间的距离小于时间间隔,就不会限制时间窗口有多大时间缓冲区

2) Group by time ranges that overlap. The end effect here would be that a time window would be as small as a single time stamp +/- the time buffer, but there is no cap on how large a time window could be, as long as multiple events were less distance apart than the time buffer

感觉像是df.groupby(pd.TimeGrouper(minutes = n))是正确的答案,但是当我看到时间缓冲区内的事件时,我不知道如何让TimeGrouper创建动态时间范围.

It feels like a df.groupby(pd.TimeGrouper(minutes=n)) is the right answer, but I don't know how to have the TimeGrouper create dynamic time ranges when it sees events that are within a time buffer.

例如,如果我针对一组事件尝试使用TimeGrouper('20s'):10:3​​4:00、10:34:08、10:34:08、10:34:15、10:34: 28和10:34:54,然后熊猫将给我三组(事件发生在10:34:00-10:34:20、10:34:20-10:34:40和10:34:40- 10:35:00).我只想找回两组,10:34:00-10:34:28,因为在那个时间范围内,事件之间的间隔不超过20秒,而第二组是10:34:54

For instance, if I try a TimeGrouper('20s') against a set of events: 10:34:00, 10:34:08, 10:34:08, 10:34:15, 10:34:28 and 10:34:54, then pandas will give me three groups (events falling between 10:34:00 - 10:34:20, 10:34:20 - 10:34:40, and 10:34:40-10:35:00). I would like to just get two groups back, 10:34:00 - 10:34:28, since there is no more than a 20 second gap between events in that time range, and a second group that is 10:34:54.

查找不是时间范围的静态bin的时间窗口的最佳方法是什么?

What is the best way to find temporal windows that are not static bins of time ranges?

给出一系列看起来像--

Given a Series that looks something like -

      time
0     2013-01-01 10:34:00+00:00
1     2013-01-01 10:34:12+00:00
2     2013-01-01 10:34:28+00:00
3     2013-01-01 10:34:54+00:00
4     2013-01-01 10:34:55+00:00
5     2013-01-01 10:35:19+00:00
6     2013-01-01 10:35:30+00:00

如果我对该系列执行df.groupby(pd.TimeGrouper('20s')),我将获得5组,即10:34:00-:20,:20-:40,:40-10 :35:00等.我想做的是具有一些创建弹性时间范围的功能..只要事件在20秒内,请扩展时间范围.所以我希望能回来-

If I do a df.groupby(pd.TimeGrouper('20s')) on that Series, I would get back 5 group, 10:34:00-:20, :20-:40, :40-10:35:00, etc. What I want to do is have some function that creates elastic timeranges.. as long as events are within 20 seconds, expand the timerange. So I expect to get back -

2013-01-01 10:34:00 - 2013-01-01 10:34:48 
    0 2013-01-01 10:34:00+00:00
    1 2013-01-01 10:34:12+00:00
    2 2013-01-01 10:34:28+00:00

2013-01-01 10:34:54 - 2013-01-01 10:35:15
    3 2013-01-01 10:34:54+00:00
    4 2013-01-01 10:34:55+00:00

2013-01-01 10:35:19 - 2013-01-01 10:35:50
    5 2013-01-01 10:35:19+00:00
    6 2013-01-01 10:35:30+00:00

谢谢.

推荐答案

这是用于创建自定义石斑鱼的方法. (需要大熊猫> = 0.13)进行timedelta计算,但在其他版本中也可以使用.

This is how to use to create a custom grouper. (requires pandas >= 0.13) for the timedelta computations, but otherwise would work in other versions.

创建您的系列

In [31]: s = Series(range(6),pd.to_datetime(['20130101 10:34','20130101 10:34:08', '20130101 10:34:08', '20130101 10:34:15', '20130101 10:34:28', '20130101 10:34:54','20130101 10:34:55','20130101 10:35:12']))

In [32]: s
Out[32]: 
2013-01-01 10:34:00    0
2013-01-01 10:34:08    1
2013-01-01 10:34:08    2
2013-01-01 10:34:15    3
2013-01-01 10:34:28    4
2013-01-01 10:34:54    5
2013-01-01 10:34:55    6
2013-01-01 10:35:12    7
dtype: int64

这只是计算连续元素之间的时间差(以秒为单位),但实际上可以是任何

This just computes the time difference in seconds between successive elements, but could actually be anything

In [33]: indexer = s.index.to_series().order().diff().fillna(0).astype('timedelta64[s]')

In [34]: indexer
Out[34]: 
2013-01-01 10:34:00     0
2013-01-01 10:34:08     8
2013-01-01 10:34:08     0
2013-01-01 10:34:15     7
2013-01-01 10:34:28    13
2013-01-01 10:34:54    26
2013-01-01 10:34:55     1
2013-01-01 10:35:12    17
dtype: float64

套利分配事物<组0为20s,组1为20s.这也可以是任意的.如果与上一个的差异是< 0但第2组的总差异(从第一个开始)大于50.

Arbitrariy assign things < 20s to group 0, else to group 1. This could also be more arbitrary. if the diff from previous is < 0 BUT the total diff (from first) is > 50 make in group 2.

In [35]: grouper = indexer.copy()

In [36]: grouper[indexer<20] = 0

In [37]: grouper[indexer>20] = 1

In [95]: grouper[(indexer<20) & (indexer.cumsum()>50)] = 2

In [96]: grouper
Out[96]: 
2013-01-01 10:34:00    0
2013-01-01 10:34:08    0
2013-01-01 10:34:08    0
2013-01-01 10:34:15    0
2013-01-01 10:34:28    0
2013-01-01 10:34:54    1
2013-01-01 10:34:55    2
2013-01-01 10:35:12    2
dtype: float64

Groupem(也可以在此处申请)

Groupem (can also use an apply here)

In [97]: s.groupby(grouper).sum()
Out[97]: 
0    10
1     5
2    13
dtype: int64

这篇关于 pandas 按时间窗分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆