如何计算 pandas 中重叠的日期时间间隔？ [英] How to count overlapping datetime intervals in Pandas?

查看：126 发布时间：2020/10/10 19:53:53 python pandas datetime count

本文介绍了如何计算 pandas 中重叠的日期时间间隔？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有两个datetime列的以下DataFrame：

I have a following DataFrame with two datetime columns:

    start               end
0   01.01.2018 00:47    01.01.2018 00:54
1   01.01.2018 00:52    01.01.2018 01:03
2   01.01.2018 00:55    01.01.2018 00:59
3   01.01.2018 00:57    01.01.2018 01:16
4   01.01.2018 01:00    01.01.2018 01:12
5   01.01.2018 01:07    01.01.2018 01:24
6   01.01.2018 01:33    01.01.2018 01:38
7   01.01.2018 01:34    01.01.2018 01:47
8   01.01.2018 01:37    01.01.2018 01:41
9   01.01.2018 01:38    01.01.2018 01:41
10  01.01.2018 01:39    01.01.2018 01:55

我想计算在给定时间结束之前，同时有多少个开始（间隔）处于活动状态（换句话说：每行与多少重叠）其余各行）。

I would like to count how many starts (intervals) are active at the same time before they end at given time (in other words: how many times each row overlaps with the rest of the rows).

例如从00:47到00:52只有一个处于活动状态，从00:52到00:54处于活动状态，从00:54到00:55则只有一个处于活动状态，依此类推。

E.g. from 00:47 to 00:52 only one is active, from 00:52 to 00:54 two, from 00:54 to 00:55 only one again, and so on.

我尝试将各列相互堆叠，按日期排序，并遍历整个数据帧，以使每个开始 +1为计数器，而为-1为每个结束。它可以工作，但是在我的原始数据框架中（我有几百万行），迭代永远需要-我需要找到一种更快的方法。

I tried to stack columns onto each other, sort by date and by iterrating through whole dataframe give each "start" +1 to counter and -1 to each "end". It works but on my original data frame, where I have few millions of rows, iteration takes forever - I need to find a quicker way.

我原来的基本但不是很好代码：

import pandas as pd
import numpy as np

df = pd.read_csv('something.csv', sep=';')

df = df.stack().to_frame()
df = df.reset_index(level=1)
df.columns = ['status', 'time']
df = df.sort_values('time')
df['counter'] = np.nan
df = df.reset_index().drop('index', axis=1)

print(df.head(10))

给予：

    status  time                counter
0   start   01.01.2018 00:47    NaN
1   start   01.01.2018 00:52    NaN
2   stop    01.01.2018 00:54    NaN
3   start   01.01.2018 00:55    NaN
4   start   01.01.2018 00:57    NaN
5   stop    01.01.2018 00:59    NaN
6   start   01.01.2018 01:00    NaN
7   stop    01.01.2018 01:03    NaN
8   start   01.01.2018 01:07    NaN
9   stop    01.01.2018 01:12    NaN

并且：

counter = 0

for index, row in df.iterrows():

    if row['status'] == 'start':
        counter += 1
    else:
        counter -= 1
    df.loc[index, 'counter'] = counter

最终输出：

        status  time                counter
    0   start   01.01.2018 00:47    1.0
    1   start   01.01.2018 00:52    2.0
    2   stop    01.01.2018 00:54    1.0
    3   start   01.01.2018 00:55    2.0
    4   start   01.01.2018 00:57    3.0
    5   stop    01.01.2018 00:59    2.0
    6   start   01.01.2018 01:00    3.0
    7   stop    01.01.2018 01:03    2.0
    8   start   01.01.2018 01:07    3.0
    9   stop    01.01.2018 01:12    2.0

是否有y，我可以通过不使用iterrows（）来做到这一点吗？

Is there any way i can do this by NOT using iterrows()?

谢谢！

推荐答案

使用 Series.cumsum 与 Series.map （或 Series.replace ）：

Use Series.cumsum with Series.map (or Series.replace):

new_df = df.melt(var_name = 'status',value_name = 'time').sort_values('time')
new_df['counter'] = new_df['status'].map({'start':1,'end':-1}).cumsum()
print(new_df)
   status                time  counter
0   start 2018-01-01 00:47:00        1
1   start 2018-01-01 00:52:00        2
11    end 2018-01-01 00:54:00        1
2   start 2018-01-01 00:55:00        2
3   start 2018-01-01 00:57:00        3
13    end 2018-01-01 00:59:00        2
4   start 2018-01-01 01:00:00        3
12    end 2018-01-01 01:03:00        2
5   start 2018-01-01 01:07:00        3
15    end 2018-01-01 01:12:00        2
14    end 2018-01-01 01:16:00        1
16    end 2018-01-01 01:24:00        0
6   start 2018-01-01 01:33:00        1
7   start 2018-01-01 01:34:00        2
8   start 2018-01-01 01:37:00        3
9   start 2018-01-01 01:38:00        4
17    end 2018-01-01 01:38:00        3
10  start 2018-01-01 01:39:00        4
19    end 2018-01-01 01:41:00        3
20    end 2018-01-01 01:41:00        2
18    end 2018-01-01 01:47:00        1
21    end 2018-01-01 01:55:00        0

我们也可以使用 nu mpy.cumsum ：

new_df['counter'] = np.where(new_df['status'].eq('start'),1,-1).cumsum()

这篇关于如何计算 pandas 中重叠的日期时间间隔？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何计算 pandas 中重叠的日期时间间隔？ [英] How to count overlapping datetime intervals in Pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何计算 pandas 中重叠的日期时间间隔？ [英] How to count overlapping datetime intervals in Pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭