以 pandas 为单位的连续时间戳记中的值计数 [英] Count of a value in consecutive timestamp in pandas

查看:48
本文介绍了以 pandas 为单位的连续时间戳记中的值计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hour              Site
01/08/2020 00:00    A
01/08/2020 00:00    B
01/08/2020 00:00    C
01/08/2020 00:00    D
01/08/2020 01:00    A
01/08/2020 01:00    B
01/08/2020 01:00    E
01/08/2020 01:00    F
01/08/2020 02:00    A
01/08/2020 02:00    E
01/08/2020 03:00    C
01/08/2020 03:00    G
 …..    
01/08/2020 04:00    x
01/08/2020 04:00    s

 …..    

01/08/2020 23:00    G
02/08/2020 00:00    G

我有一个像上面的数据框.我想计算一个网站连续几个小时出现的次数&开始和结束时间戳记.每小时都有多个地点.例如,站点A在3个连续的时间戳中出现,然后在一个时间戳中再次出现.我想要类似下面的输出,或者更有效的格式.

I have a dataframe like above. I want to count how many times a site comes in consecutive hours & start and end timestamp. wheres in each hour there are multiple sites. For example site A appears in in 3 consecutive timestamp, then again in one timestamp. I want an output like below, or in more effective format.

Hour              Site count    period_start    Period_end
01/08/2020 00:00    A   3   01/08/2020 00:00    01/08/2020 03:00
01/08/2020 00:00    B   2   01/08/2020 00:00    01/08/2020 01:00
01/08/2020 00:00    C   1   ….. …
01/08/2020 00:00    D   1   ….  ….
01/08/2020 01:00    A   3   01/08/2020 00:00    01/08/2020 03:00
01/08/2020 01:00    B   2   ….  ….
01/08/2020 01:00    E   2   ….  ….
01/08/2020 01:00    F   1   ….  ….
01/08/2020 02:00    A   3   01/08/2020 00:00    01/08/2020 03:00
01/08/2020 02:00    E   2   ….  ….
01/08/2020 03:00    C   1   ….  ….
01/08/2020 03:00    G   1   ….  ….
 …..            ….  ….
01/08/2020 04:00    x   1   01/08/2020 04:00    01/08/2020 04:00
01/08/2020 04:00    s   1   ….  ….
            ….  ….
 …..            ….  ….
            ….  ….
01/08/2020 23:00    G   2   ….  ….
02/08/2020 00:00    G   2   ….  ….

谢谢!

推荐答案

从定义2个函数开始:

def cnt(grp):
    hr = grp.Hour
    return grp.assign(count=hr.size, period_start=hr.iloc[0], period_end=hr.iloc[-1])

def fn(grp):
    gr = grp.groupby((grp.Hour - grp.Hour.shift()).gt(pd.Timedelta('1H')).cumsum())
    return gr.apply(cnt)

然后分组并应用它:

df.groupby('Site').apply(fn).reset_index(level=[0, 1], drop=True).sort_index()

您应该从头开始阅读代码.

You should start reading of the code from the end.

第一步是按 Site 分组(分组的第一级)并将 fn 应用于每个组.暂时跳过本说明的其余部分.

The first step is to group by Site (the first level of grouping) and apply fn to each group. For the time being skip the rest of this instruction.

然后 fn 函数执行第二级分组.想法是将源(第一级)组划分为以下组:连续几个小时.

Then fn function performs the second level grouping. The idea is to divide the source (first level) group into groups of rows for consecutive hours.

对每个(第二级)组应用 cnt 功能.其结果是添加了 count period_start period_end 列.

To each (second level) group cnt function is applied. Its result is the source group with added count, period_start and period_end columns.

现在有时间查看第一条指令的(跳过)部分. groupby(...).apply(...)部分生成以下结果(为简洁起见我只包含了 Site == A B 的结果.

And now there is time to look at the (skipped) part of the first instruction. The groupby(...).apply(...) part generates the following result (for brevity I included only result for Site == A and B.

                            Hour Site  count        period_start           period_end
Site Hour                                                                            
A    0    0  2020-08-01 00:00:00    A      3 2020-08-01 00:00:00  2020-08-01 02:00:00
          4  2020-08-01 01:00:00    A      3 2020-08-01 00:00:00  2020-08-01 02:00:00
          8  2020-08-01 02:00:00    A      3 2020-08-01 00:00:00  2020-08-01 02:00:00
     1    12 2020-08-01 04:00:00    A      2 2020-08-01 04:00:00  2020-08-01 05:00:00
          14 2020-08-01 05:00:00    A      2 2020-08-01 04:00:00  2020-08-01 05:00:00
     2    15 2020-08-01 08:00:00    A      1 2020-08-01 08:00:00  2020-08-01 08:00:00
B    0    1  2020-08-01 00:00:00    B      2 2020-08-01 00:00:00  2020-08-01 01:00:00
          5  2020-08-01 01:00:00    B      2 2020-08-01 00:00:00  2020-08-01 01:00:00

要获得最终结果,需要:

To get the final result, there is a need to:

  • reset_index(...)-删除索引的前两个级别.
  • sort_index()-按索引对行进行排序.
  • reset_index(...) - drop the first 2 levels of the index.
  • sort_index() - sort rows by index.

结果与您预期的一样.

这篇关于以 pandas 为单位的连续时间戳记中的值计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆