有效地从2个数据框中找到日期时间范围的重叠 [英] Efficiently find overlap of date-time ranges from 2 dataframes
问题描述
关于查找日期或时间范围的重叠部分存在一些问题(例如).我用这些来解决我的问题,但最终我得到了一个极其缓慢(且并非完全优雅)的解决方案.如果有人对如何更快地做到这一点(并且更加优雅)有想法,我将不胜感激:
There are some questions out there regarding finding the overlap in date or time ranges (for example). I've used these to solve my problem, but I've ended up with an extremely slow (and not-at-all elegant) solution to my problem. I would really appreciate it if someone has an idea of how to make this faster (and more elegant):
问题:
我有2个数据框df1
和df2
,每个数据框都有2列,分别代表开始时间和结束时间:
I've got 2 dataframes, df1
and df2
, each with 2 columns that represent a start time and an end time:
>>> df1
datetime_start datetime_end
0 2016-09-11 06:00:00 2016-09-11 06:30:00
1 2016-09-11 07:00:00 2016-09-11 07:30:00
2 2016-09-11 07:30:00 2016-09-11 08:00:00
3 2016-09-11 08:00:00 2016-09-11 08:30:00
4 2016-09-11 08:30:00 2016-09-11 09:00:00
5 2016-09-11 09:00:00 2016-09-11 09:30:00
6 2016-09-11 09:30:00 2016-09-11 10:00:00
7 2016-09-11 10:30:00 2016-09-11 11:00:00
13 2016-09-11 14:00:00 2016-09-11 14:30:00
14 2016-09-11 14:30:00 2016-09-11 15:00:00
15 2016-09-11 15:00:00 2016-09-11 15:30:00
16 2016-09-11 15:30:00 2016-09-11 16:00:00
17 2016-09-11 16:00:00 2016-09-11 16:30:00
18 2016-09-11 16:30:00 2016-09-11 17:00:00
19 2016-09-11 17:00:00 2016-09-11 17:30:00
>>> df2
datetime_start datetime_end catg
4 2016-09-11 08:48:33 2016-09-11 09:41:53 a
6 2016-09-11 09:54:25 2016-09-11 10:00:50 a
8 2016-09-11 10:01:47 2016-09-11 10:04:55 b
10 2016-09-11 10:08:00 2016-09-11 10:08:11 b
12 2016-09-11 10:30:28 2016-09-11 10:30:28 b
14 2016-09-11 10:38:18 2016-09-11 10:38:18 a
18 2016-09-11 13:44:05 2016-09-11 13:44:05 a
20 2016-09-11 13:46:52 2016-09-11 14:11:41 d
23 2016-09-11 14:22:17 2016-09-11 14:33:40 b
25 2016-09-11 15:00:12 2016-09-11 15:02:55 b
27 2016-09-11 15:04:19 2016-09-11 15:06:36 b
29 2016-09-11 15:08:43 2016-09-11 15:31:29 d
31 2016-09-11 15:38:04 2016-09-11 16:09:24 a
33 2016-09-11 16:18:40 2016-09-11 16:44:32 b
35 2016-09-11 16:45:59 2016-09-11 16:59:01 b
37 2016-09-11 17:08:31 2016-09-11 17:12:23 b
39 2016-09-11 17:16:13 2016-09-11 17:16:33 c
41 2016-09-11 17:17:23 2016-09-11 17:20:00 b
45 2016-09-13 12:27:59 2016-09-13 12:34:21 a
47 2016-09-13 12:38:39 2016-09-13 12:38:45 a
我想要的是找到df2
中的范围与df1
中的范围重叠的位置,重叠的时间(以秒为单位)以及df2.catg
的值是多少.我想将重叠部分的长度插入df1
的列中(该列将以其表示的catg
命名).
What I want is to find where the ranges in df2
have overlap with the ranges in df1
, how long that overlap is (in seconds), and what value of df2.catg
that is. I want the length of that overlap inserted into a column in df1
(that column will be named for the catg
it represents).
所需的输出:
>>> df1
datetime_start datetime_end a b d c
0 2016-09-11 06:00:00 2016-09-11 06:30:00 0.0 0.0 0.0 0.0
1 2016-09-11 07:00:00 2016-09-11 07:30:00 0.0 0.0 0.0 0.0
2 2016-09-11 07:30:00 2016-09-11 08:00:00 0.0 0.0 0.0 0.0
3 2016-09-11 08:00:00 2016-09-11 08:30:00 0.0 0.0 0.0 0.0
4 2016-09-11 08:30:00 2016-09-11 09:00:00 687.0 0.0 0.0 0.0
5 2016-09-11 09:00:00 2016-09-11 09:30:00 1800.0 0.0 0.0 0.0
6 2016-09-11 09:30:00 2016-09-11 10:00:00 1048.0 0.0 0.0 0.0
7 2016-09-11 10:30:00 2016-09-11 11:00:00 0.0 0.0 0.0 0.0
13 2016-09-11 14:00:00 2016-09-11 14:30:00 0.0 463.0 701.0 0.0
14 2016-09-11 14:30:00 2016-09-11 15:00:00 0.0 220.0 0.0 0.0
15 2016-09-11 15:00:00 2016-09-11 15:30:00 0.0 300.0 1277.0 0.0
16 2016-09-11 15:30:00 2016-09-11 16:00:00 1316.0 0.0 89.0 0.0
17 2016-09-11 16:00:00 2016-09-11 16:30:00 564.0 680.0 0.0 0.0
18 2016-09-11 16:30:00 2016-09-11 17:00:00 0.0 1654.0 0.0 0.0
19 2016-09-11 17:00:00 2016-09-11 17:30:00 0.0 389.0 0.0 20.0
执行此操作的荒谬方式:
基于这个美丽的答案,我已经达到了我想要使用以下难以理解的代码实现的目标:
Based on this beautiful answer, I've achieved the goals I want using the following hard to follow code:
from collections import namedtuple
Range = namedtuple('Range', ['start', 'end'])
def overlap(row1, row2):
r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
latest_start = max(r1.start, r2.start)
earliest_end = min(r1.end, r2.end)
delta = (earliest_end - latest_start).total_seconds()
overlap = max(0, delta)
return overlap
for cat in df2.catg.unique().tolist():
df1[cat] = 0
for idx1, row1 in df1.iterrows():
for idx2, row2 in df2.iterrows():
if overlap(row1, row2) > 0:
df1.loc[idx1, row2.catg] += overlap(row1, row2)
这是可行的,但是对于较大的数据帧来说太慢了,因为它基本上是无法使用的.如果有人有什么想法可以加快速度,我将很高兴您的投入.
This works, but is soooo slow on larger dataframes that it's basically un-useable. If anyone has any ideas to speed this up, I'd love your input.
预先感谢,如果有不清楚的地方,请通知我!
Thanks in advance, and let me know if something is unclear!
数据框设置:
import pandas as pd
from pandas import Timestamp
d1 = {'datetime_start': {0: Timestamp('2016-09-11 06:00:00'), 1: Timestamp('2016-09-11 07:00:00'), 2: Timestamp('2016-09-11 07:30:00'), 3: Timestamp('2016-09-11 08:00:00'), 4: Timestamp('2016-09-11 08:30:00'), 5: Timestamp('2016-09-11 09:00:00'), 6: Timestamp('2016-09-11 09:30:00'), 7: Timestamp('2016-09-11 10:30:00'), 13: Timestamp('2016-09-11 14:00:00'), 14: Timestamp('2016-09-11 14:30:00'), 15: Timestamp('2016-09-11 15:00:00'), 16: Timestamp('2016-09-11 15:30:00'), 17: Timestamp('2016-09-11 16:00:00'), 18: Timestamp('2016-09-11 16:30:00'), 19: Timestamp('2016-09-11 17:00:00')}, 'datetime_end': {0: Timestamp('2016-09-11 06:30:00'), 1: Timestamp('2016-09-11 07:30:00'), 2: Timestamp('2016-09-11 08:00:00'), 3: Timestamp('2016-09-11 08:30:00'), 4: Timestamp('2016-09-11 09:00:00'), 5: Timestamp('2016-09-11 09:30:00'), 6: Timestamp('2016-09-11 10:00:00'), 7: Timestamp('2016-09-11 11:00:00'), 13: Timestamp('2016-09-11 14:30:00'), 14: Timestamp('2016-09-11 15:00:00'), 15: Timestamp('2016-09-11 15:30:00'), 16: Timestamp('2016-09-11 16:00:00'), 17: Timestamp('2016-09-11 16:30:00'), 18: Timestamp('2016-09-11 17:00:00'), 19: Timestamp('2016-09-11 17:30:00')}}
d2 = {'datetime_start': {4: Timestamp('2016-09-11 08:48:33'), 6: Timestamp('2016-09-11 09:54:25'), 8: Timestamp('2016-09-11 10:01:47'), 10: Timestamp('2016-09-11 10:08:00'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 13:46:52'), 23: Timestamp('2016-09-11 14:22:17'), 25: Timestamp('2016-09-11 15:00:12'), 27: Timestamp('2016-09-11 15:04:19'), 29: Timestamp('2016-09-11 15:08:43'), 31: Timestamp('2016-09-11 15:38:04'), 33: Timestamp('2016-09-11 16:18:40'), 35: Timestamp('2016-09-11 16:45:59'), 37: Timestamp('2016-09-11 17:08:31'), 39: Timestamp('2016-09-11 17:16:13'), 41: Timestamp('2016-09-11 17:17:23'), 45: Timestamp('2016-09-13 12:27:59'), 47: Timestamp('2016-09-13 12:38:39')}, 'datetime_end': {4: Timestamp('2016-09-11 09:41:53'), 6: Timestamp('2016-09-11 10:00:50'), 8: Timestamp('2016-09-11 10:04:55'), 10: Timestamp('2016-09-11 10:08:11'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 14:11:41'), 23: Timestamp('2016-09-11 14:33:40'), 25: Timestamp('2016-09-11 15:02:55'), 27: Timestamp('2016-09-11 15:06:36'), 29: Timestamp('2016-09-11 15:31:29'), 31: Timestamp('2016-09-11 16:09:24'), 33: Timestamp('2016-09-11 16:44:32'), 35: Timestamp('2016-09-11 16:59:01'), 37: Timestamp('2016-09-11 17:12:23'), 39: Timestamp('2016-09-11 17:16:33'), 41: Timestamp('2016-09-11 17:20:00'), 45: Timestamp('2016-09-13 12:34:21'), 47: Timestamp('2016-09-13 12:38:45')}, 'catg': {4: 'a', 6: 'a', 8: 'b', 10: 'b', 12: 'b', 14: 'a', 18: 'a', 20: 'd', 23: 'b', 25: 'b', 27: 'b', 29: 'd', 31: 'a', 33: 'b', 35: 'b', 37: 'b', 39: 'c', 41: 'b', 45: 'a', 47: 'a'}}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
推荐答案
基于timeit
测试,每个测试执行100次,问题中的namedtuple
方法在我的计算机上平均15.7314
秒,而平均这种方法的时间为1.4794
秒:
Based on timeit
tests, with 100 executions each, the namedtuple
approach in the question averaged 15.7314
seconds on my machine, vs. an average of 1.4794
seconds with this approach:
# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values
# create a numpy array with one timestamp for each second
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')
df1.merge(pd.DataFrame({'datetime_start':seconds_range,
'catg':np.repeat(df2.catg, duration)}). \
groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
size(). \
unstack(level=0). \
reset_index(),
how="left")
# datetime_end datetime_start a b c d
# 0 2016-09-11 06:30:00 2016-09-11 06:00:00 NaN NaN NaN NaN
# 1 2016-09-11 07:30:00 2016-09-11 07:00:00 NaN NaN NaN NaN
# 2 2016-09-11 08:00:00 2016-09-11 07:30:00 NaN NaN NaN NaN
# 3 2016-09-11 08:30:00 2016-09-11 08:00:00 NaN NaN NaN NaN
# 4 2016-09-11 09:00:00 2016-09-11 08:30:00 687.0 NaN NaN NaN
# 5 2016-09-11 09:30:00 2016-09-11 09:00:00 1800.0 NaN NaN NaN
# 6 2016-09-11 10:00:00 2016-09-11 09:30:00 1048.0 NaN NaN NaN
# 7 2016-09-11 11:00:00 2016-09-11 10:30:00 NaN NaN NaN NaN
# 8 2016-09-11 14:30:00 2016-09-11 14:00:00 NaN 463.0 NaN 701.0
# 9 2016-09-11 15:00:00 2016-09-11 14:30:00 NaN 220.0 NaN NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00 NaN 300.0 NaN 1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00 1316.0 NaN NaN 89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00 564.0 680.0 NaN NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00 NaN 1654.0 NaN NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00 NaN 389.0 20.0 NaN
这篇关于有效地从2个数据框中找到日期时间范围的重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!