如何计算 pandas 中重叠的日期时间间隔? [英] How to count overlapping datetime intervals in Pandas?
问题描述
我有一个带有两个datetime列的以下DataFrame:
I have a following DataFrame with two datetime columns:
start end
0 01.01.2018 00:47 01.01.2018 00:54
1 01.01.2018 00:52 01.01.2018 01:03
2 01.01.2018 00:55 01.01.2018 00:59
3 01.01.2018 00:57 01.01.2018 01:16
4 01.01.2018 01:00 01.01.2018 01:12
5 01.01.2018 01:07 01.01.2018 01:24
6 01.01.2018 01:33 01.01.2018 01:38
7 01.01.2018 01:34 01.01.2018 01:47
8 01.01.2018 01:37 01.01.2018 01:41
9 01.01.2018 01:38 01.01.2018 01:41
10 01.01.2018 01:39 01.01.2018 01:55
我想计算在给定时间结束之前,同时有多少个开始(间隔)处于活动状态(换句话说:每行与多少重叠)其余各行)。
I would like to count how many starts (intervals) are active at the same time before they end at given time (in other words: how many times each row overlaps with the rest of the rows).
例如从00:47到00:52只有一个处于活动状态,从00:52到00:54处于活动状态,从00:54到00:55则只有一个处于活动状态,依此类推。
E.g. from 00:47 to 00:52 only one is active, from 00:52 to 00:54 two, from 00:54 to 00:55 only one again, and so on.
我尝试将各列相互堆叠,按日期排序,并遍历整个数据帧,以使每个开始 +1为计数器,而为-1为每个结束。它可以工作,但是在我的原始数据框架中(我有几百万行),迭代永远需要-我需要找到一种更快的方法。
I tried to stack columns onto each other, sort by date and by iterrating through whole dataframe give each "start" +1 to counter and -1 to each "end". It works but on my original data frame, where I have few millions of rows, iteration takes forever - I need to find a quicker way.
我原来的基本但不是很好代码:
import pandas as pd
import numpy as np
df = pd.read_csv('something.csv', sep=';')
df = df.stack().to_frame()
df = df.reset_index(level=1)
df.columns = ['status', 'time']
df = df.sort_values('time')
df['counter'] = np.nan
df = df.reset_index().drop('index', axis=1)
print(df.head(10))
给予:
status time counter
0 start 01.01.2018 00:47 NaN
1 start 01.01.2018 00:52 NaN
2 stop 01.01.2018 00:54 NaN
3 start 01.01.2018 00:55 NaN
4 start 01.01.2018 00:57 NaN
5 stop 01.01.2018 00:59 NaN
6 start 01.01.2018 01:00 NaN
7 stop 01.01.2018 01:03 NaN
8 start 01.01.2018 01:07 NaN
9 stop 01.01.2018 01:12 NaN
并且:
counter = 0
for index, row in df.iterrows():
if row['status'] == 'start':
counter += 1
else:
counter -= 1
df.loc[index, 'counter'] = counter
最终输出:
status time counter
0 start 01.01.2018 00:47 1.0
1 start 01.01.2018 00:52 2.0
2 stop 01.01.2018 00:54 1.0
3 start 01.01.2018 00:55 2.0
4 start 01.01.2018 00:57 3.0
5 stop 01.01.2018 00:59 2.0
6 start 01.01.2018 01:00 3.0
7 stop 01.01.2018 01:03 2.0
8 start 01.01.2018 01:07 3.0
9 stop 01.01.2018 01:12 2.0
是否有y,我可以通过不使用iterrows()来做到这一点吗?
Is there any way i can do this by NOT using iterrows()?
谢谢!
推荐答案
使用 Series.cumsum
与 Series.map
(或 Series.replace
):
Use Series.cumsum
with Series.map
(or Series.replace
):
new_df = df.melt(var_name = 'status',value_name = 'time').sort_values('time')
new_df['counter'] = new_df['status'].map({'start':1,'end':-1}).cumsum()
print(new_df)
status time counter
0 start 2018-01-01 00:47:00 1
1 start 2018-01-01 00:52:00 2
11 end 2018-01-01 00:54:00 1
2 start 2018-01-01 00:55:00 2
3 start 2018-01-01 00:57:00 3
13 end 2018-01-01 00:59:00 2
4 start 2018-01-01 01:00:00 3
12 end 2018-01-01 01:03:00 2
5 start 2018-01-01 01:07:00 3
15 end 2018-01-01 01:12:00 2
14 end 2018-01-01 01:16:00 1
16 end 2018-01-01 01:24:00 0
6 start 2018-01-01 01:33:00 1
7 start 2018-01-01 01:34:00 2
8 start 2018-01-01 01:37:00 3
9 start 2018-01-01 01:38:00 4
17 end 2018-01-01 01:38:00 3
10 start 2018-01-01 01:39:00 4
19 end 2018-01-01 01:41:00 3
20 end 2018-01-01 01:41:00 2
18 end 2018-01-01 01:47:00 1
21 end 2018-01-01 01:55:00 0
我们也可以使用 nu mpy.cumsum
:
new_df['counter'] = np.where(new_df['status'].eq('start'),1,-1).cumsum()
这篇关于如何计算 pandas 中重叠的日期时间间隔?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!