如何在大 pandas 时间间隔中间隔5分钟创建组ID? [英] how to create a group ID based on 5 minutes interval in pandas timeseries?

查看:128
本文介绍了如何在大 pandas 时间间隔中间隔5分钟创建组ID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个时间序列数据框 df 看起来像这样(时间序列发生在同一天内,但在不同的时间段:

I have a timeseries dataframe df looks like this (the time seris happen within same day, but across different hours:

                                id               val 
 time                    
2014-04-03 16:01:53             23              14389      
2014-04-03 16:01:54             28              14391             
2014-04-03 16:05:55             24              14393             
2014-04-03 16:06:25             23              14395             
2014-04-03 16:07:01             23              14395             
2014-04-03 16:10:09             23              14395             
2014-04-03 16:10:23             26              14397             
2014-04-03 16:10:57             26              14397             
2014-04-03 16:11:10             26              14397              

我需要每5个创建一个组分钟从开始 16:00:00 。那就是范围 16:00:00 16:05:00 的所有行,其值新列期间是1.(每组中的行数是不规则的,所以我不能简单地剪切组)

I need to create group every 5 minutes from starting from 16:00:00. That is all the rows with in the range 16:00:00 to 16:05:00, its value of the new column period is 1. (the number of rows within each group is irregular, so i can't simply cut the group)

最终,数据应该如下所示:

Eventually, the data should look like this:

                                id               val           period 
time            
2014-04-03 16:01:53             23              14389             1
2014-04-03 16:01:54             28              14391             1
2014-04-03 16:05:55             24              14393             2
2014-04-03 16:06:25             23              14395             2
2014-04-03 16:07:01             23              14395             2
2014-04-03 16:10:09             23              14395             3
2014-04-03 16:10:23             26              14397             3
2014-04-03 16:10:57             26              14397             3
2014-04-03 16:11:10             26              14397             3

目的是执行一些 groupby 操作,但我需要做的操作不包括在 pd.resample(how ='')方法中。因此,我必须创建一个期间列来标识每个组,然后执行 df.groupby('period')。apply(myfunc)

The purpose is to perform some groupby operation, but the operation I need to do is not included in pd.resample(how=' ') method. So I have to create a period column to identify each group, then do df.groupby('period').apply(myfunc).

非常感谢任何帮助或意见。

Any help or comments are highly appreciated.

谢谢!

推荐答案

您可以在 groupy / apply中使用 TimeGrouper 函数。使用 TimeGrouper ,您不需要创建期间列。我知道你不想计算平均值,但我会用它作为一个例子:

You can use the TimeGrouper function in a groupy/apply. With a TimeGrouper you don't need to create your period column. I know you're not trying to compute the mean but I will use it as an example:

>>> df.groupby(pd.TimeGrouper('5Min'))['val'].mean()

time
2014-04-03 16:00:00    14390.000000
2014-04-03 16:05:00    14394.333333
2014-04-03 16:10:00    14396.500000

或者一个具有明确的的示例应用

>>> df.groupby(pd.TimeGrouper('5Min'))['val'].apply(lambda x: len(x) > 3)

time
2014-04-03 16:00:00    False
2014-04-03 16:05:00    False
2014-04-03 16:10:00     True

Doctstring for TimeGrouper

Doctstring for TimeGrouper:

Docstring for resample:class TimeGrouper@21

TimeGrouper(self, freq = 'Min', closed = None, label = None,
how = 'mean', nperiods = None, axis = 0, fill_method = None,
limit = None, loffset = None, kind = None, convention = None, base = 0,
**kwargs)

Custom groupby class for time-interval grouping

Parameters
----------
freq : pandas date offset or offset alias for identifying bin edges
closed : closed end of interval; left or right
label : interval boundary to use for labeling; left or right
nperiods : optional, integer
convention : {'start', 'end', 'e', 's'}
    If axis is PeriodIndex

Notes
-----
Use begin, end, nperiods to generate intervals that cannot be derived
directly from the associated object

编辑

我不知道一个优雅的方式来创建期间列,但以下内容将起作用:

I don't know of an elegant way to create the period column, but the following will work:

>>> new = df.groupby(pd.TimeGrouper('5Min'),as_index=False).apply(lambda x: x['val'])
>>> df['period'] = new.index.get_level_values(0)
>>> df

                     id    val  period
time
2014-04-03 16:01:53  23  14389       0
2014-04-03 16:01:54  28  14391       0 
2014-04-03 16:05:55  24  14393       1
2014-04-03 16:06:25  23  14395       1
2014-04-03 16:07:01  23  14395       1
2014-04-03 16:10:09  23  14395       2
2014-04-03 16:10:23  26  14397       2
2014-04-03 16:10:57  26  14397       2
2014-04-03 16:11:10  26  14397       2

它工作是因为在这里使用as_index = False的groupby实际上返回你想要的作为多个索引的一部分的句点列,而我只是抓住该多索引的一部分,并分配到原始数据帧中的一个新列。你可以在申请中做任何事情,我只想要索引:

It works because the groupby here with as_index=False actually returns the period column you want as the part of the multiindex and I just grab that part of the multiindex and assign to a new column in the orginal dataframe. You could do anything in the apply, I just want the index:

>>> new

   time
0  2014-04-03 16:01:53    14389
   2014-04-03 16:01:54    14391
1  2014-04-03 16:05:55    14393
   2014-04-03 16:06:25    14395
   2014-04-03 16:07:01    14395
2  2014-04-03 16:10:09    14395
   2014-04-03 16:10:23    14397
   2014-04-03 16:10:57    14397
   2014-04-03 16:11:10    14397

>>>  new.index.get_level_values(0)

Int64Index([0, 0, 1, 1, 1, 2, 2, 2, 2], dtype='int64')

这篇关于如何在大 pandas 时间间隔中间隔5分钟创建组ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆