以 pandas 为单位的连续时间戳记中的值计数 [英] Count of a value in consecutive timestamp in pandas
问题描述
Hour Site
01/08/2020 00:00 A
01/08/2020 00:00 B
01/08/2020 00:00 C
01/08/2020 00:00 D
01/08/2020 01:00 A
01/08/2020 01:00 B
01/08/2020 01:00 E
01/08/2020 01:00 F
01/08/2020 02:00 A
01/08/2020 02:00 E
01/08/2020 03:00 C
01/08/2020 03:00 G
…..
01/08/2020 04:00 x
01/08/2020 04:00 s
…..
01/08/2020 23:00 G
02/08/2020 00:00 G
我有一个像上面的数据框.我想计算一个网站连续几个小时出现的次数&开始和结束时间戳记.每小时都有多个地点.例如,站点A在3个连续的时间戳中出现,然后在一个时间戳中再次出现.我想要类似下面的输出,或者更有效的格式.
I have a dataframe like above. I want to count how many times a site comes in consecutive hours & start and end timestamp. wheres in each hour there are multiple sites. For example site A appears in in 3 consecutive timestamp, then again in one timestamp. I want an output like below, or in more effective format.
Hour Site count period_start Period_end
01/08/2020 00:00 A 3 01/08/2020 00:00 01/08/2020 03:00
01/08/2020 00:00 B 2 01/08/2020 00:00 01/08/2020 01:00
01/08/2020 00:00 C 1 ….. …
01/08/2020 00:00 D 1 …. ….
01/08/2020 01:00 A 3 01/08/2020 00:00 01/08/2020 03:00
01/08/2020 01:00 B 2 …. ….
01/08/2020 01:00 E 2 …. ….
01/08/2020 01:00 F 1 …. ….
01/08/2020 02:00 A 3 01/08/2020 00:00 01/08/2020 03:00
01/08/2020 02:00 E 2 …. ….
01/08/2020 03:00 C 1 …. ….
01/08/2020 03:00 G 1 …. ….
….. …. ….
01/08/2020 04:00 x 1 01/08/2020 04:00 01/08/2020 04:00
01/08/2020 04:00 s 1 …. ….
…. ….
….. …. ….
…. ….
01/08/2020 23:00 G 2 …. ….
02/08/2020 00:00 G 2 …. ….
谢谢!
推荐答案
从定义2个函数开始:
def cnt(grp):
hr = grp.Hour
return grp.assign(count=hr.size, period_start=hr.iloc[0], period_end=hr.iloc[-1])
def fn(grp):
gr = grp.groupby((grp.Hour - grp.Hour.shift()).gt(pd.Timedelta('1H')).cumsum())
return gr.apply(cnt)
然后分组并应用它:
df.groupby('Site').apply(fn).reset_index(level=[0, 1], drop=True).sort_index()
您应该从头开始阅读代码.
You should start reading of the code from the end.
第一步是按 Site 分组(分组的第一级)并将 fn 应用于每个组.暂时跳过本说明的其余部分.
The first step is to group by Site (the first level of grouping) and apply fn to each group. For the time being skip the rest of this instruction.
然后 fn 函数执行第二级分组.想法是将源(第一级)组划分为以下组:连续几个小时.
Then fn function performs the second level grouping. The idea is to divide the source (first level) group into groups of rows for consecutive hours.
对每个(第二级)组应用 cnt 功能.其结果是添加了 count , period_start 和 period_end 列.
To each (second level) group cnt function is applied. Its result is the source group with added count, period_start and period_end columns.
现在有时间查看第一条指令的(跳过)部分. groupby(...).apply(...)部分生成以下结果(为简洁起见我只包含了 Site == A 和 B 的结果.
And now there is time to look at the (skipped) part of the first instruction. The groupby(...).apply(...) part generates the following result (for brevity I included only result for Site == A and B.
Hour Site count period_start period_end
Site Hour
A 0 0 2020-08-01 00:00:00 A 3 2020-08-01 00:00:00 2020-08-01 02:00:00
4 2020-08-01 01:00:00 A 3 2020-08-01 00:00:00 2020-08-01 02:00:00
8 2020-08-01 02:00:00 A 3 2020-08-01 00:00:00 2020-08-01 02:00:00
1 12 2020-08-01 04:00:00 A 2 2020-08-01 04:00:00 2020-08-01 05:00:00
14 2020-08-01 05:00:00 A 2 2020-08-01 04:00:00 2020-08-01 05:00:00
2 15 2020-08-01 08:00:00 A 1 2020-08-01 08:00:00 2020-08-01 08:00:00
B 0 1 2020-08-01 00:00:00 B 2 2020-08-01 00:00:00 2020-08-01 01:00:00
5 2020-08-01 01:00:00 B 2 2020-08-01 00:00:00 2020-08-01 01:00:00
要获得最终结果,需要:
To get the final result, there is a need to:
- reset_index(...)-删除索引的前两个级别.
- sort_index()-按索引对行进行排序.
- reset_index(...) - drop the first 2 levels of the index.
- sort_index() - sort rows by index.
结果与您预期的一样.
这篇关于以 pandas 为单位的连续时间戳记中的值计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!