折叠并合并重叠的时间间隔 [英] Collapse and merge overlapping time intervals

查看:60
本文介绍了折叠并合并重叠的时间间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个基于 tidyverse 的数据工作流,遇到一种情况,即我的数据帧具有很多时间间隔.让我们将数据帧称为 my_time_intervals ,它可以像这样复制:

I am developing a tidyverse-based data workflow, and came across a situation where I have a data frame with lots of time intervals. Let's call the data frame my_time_intervals, and it can be reproduced like this:

library(tidyverse)
library(lubridate)

my_time_intervals <- tribble(
    ~id, ~group, ~start_time, ~end_time,
    1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
    2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
    3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
    4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
    5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
    6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
    7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
    8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

这是同一数据框的 tibble 视图:

Here's a tibble view of the same data frame:

> my_time_intervals
# A tibble: 8 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-14 02:32:10
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     3     1 2018-05-07 13:02:04 2018-05-23 08:13:06
4     4     2 2018-02-28 17:43:29 2018-04-20 03:48:40
5     5     2 2018-04-20 01:19:52 2018-08-12 12:56:37
6     6     2 2018-04-18 20:47:22 2018-04-19 16:07:29
7     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
8     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

关于 my_time_intervals 的一些说明:

  1. 通过 group 变量将数据分为三组.

id 变量只是数据框中每一行的唯一ID.

The id variable is just a unique ID for each row in the data frame.

时间间隔的开始和结束以 lubridate 形式存储在 start_time end_time 中.

The start and end of time intervals are stored in start_time and end_time in lubridate form.

有些时间间隔重叠,有些则不重叠,并且它们总是 not 依次排列.例如,第 1 行与第 3 行重叠,但是它们都不与第 2 行重叠.

Some time intervals overlap, some don't, and they are not always in order. For example, row 1 overlaps with row 3, but neither of them overlaps with row 2.

两个以上的间隔可能会相互重叠,并且某些间隔会完全落入其他间隔内.请参阅 group == 2 中的 4 6 行.

More than two intervals may overlap with each other, and some intervals fall completely within others. See rows 4 through 6 in group == 2.

我想要的是在每个 group 中将所有重叠的时间间隔折叠为连续的间隔.在这种情况下,我想要的结果将如下所示:

What I want is that within each group, collapse any overlapping time intervals into contiguous intervals. In this case, my desired result would look like:

# A tibble: 5 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     4     2 2018-02-28 17:43:29 2018-08-12 12:56:37
4     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
5     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

请注意,不同的 group 之间的重叠的时间间隔 not 不合并.另外,我现在不在乎 id 列会发生什么.

Notice that time intervals that overlap between different groups are not merged. Also, I don't care about what happens to the id column at this point.

我知道 lubridate 软件包包括与间隔相关的功能,但是我不知道如何将它们应用于此用例.

I know that the lubridate package includes interval-related functions, but I can't figure out how to apply them to this use case.

我该如何实现?非常感谢.

How to I achieve this? Thank you very much.

推荐答案

my_time_intervals %>% group_by(group) %>% arrange(start_time) %>% 
                      mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
                      group_by(group, indx) %>%
                      summarise(start_time = min(start_time), end_time = max(end_time)) %>%
                      select(-indx)


# # A tibble: 5 x 3
# # Groups:   group [3]
# group start_time          end_time           
# <int> <dttm>              <dttm>             
# 1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4     2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5     3 2018-03-11 22:30:51 2018-10-20 21:01:42

每个OP的要求的解释:

我正在制作另一个数据集,该数据集在每个组中有更多的重叠时间,因此该解决方案将获得更多的了解,并希望可以更好地加以掌握;

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
  5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
  6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
  7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

因此,让我们看一下该数据集的 indx 列.我在 group 列中添加 arrange ,以将所有相同的分组行放在一起;但是,正如您所知道的,因为我们有 group_by(group),所以我们实际上并不需要它.

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()]))


  # # A tibble: 13 x 5
  # # Groups:   group [3]
  # id group start_time          end_time             indx
  # <int> <int> <dttm>              <dttm>              <dbl>
  # 1     1      1 2018-04-12 11:15:03 2018-05-14 02:32:10     0
  # 2     7      1 2018-05-07 13:02:04 2018-05-23 08:13:06     0
  # 3     8      1 2018-05-10 13:02:04 2018-05-23 08:13:06     0
  # 4     2      1 2018-07-04 02:53:20 2018-07-14 18:09:01     1
  # 5     3      1 2018-07-05 02:53:20 2018-07-14 18:09:01     1
  # 6     5      1 2018-07-15 01:53:20 2018-07-19 18:09:01     2
  # 7     4      1 2018-07-15 02:53:20 2018-07-16 18:09:01     2
  # 8     6      1 2018-07-20 02:53:20 2018-07-22 18:09:01     3
  # 9     9      2 2018-02-28 17:43:29 2018-04-20 03:48:40     0
  # 10    11     2 2018-04-18 20:47:22 2018-04-19 16:07:29     0
  # 11    10     2 2018-04-20 01:19:52 2018-08-12 12:56:37     0
  # 12    12     2 2018-10-02 14:08:03 2018-11-08 00:01:23     1
  # 13    13     3 2018-03-11 22:30:51 2018-10-20 21:01:42     0

如您所见,在一组中,我们有3个不同的时间段,其中数据点重叠,而一个数据点在该组中没有重叠的条目. indx 列将这些数据点分为4组(即 0、1、2、3 ).在解决方案的稍后部分,当我们 group_by(indx,group)时,我们将所有这些重叠的部分放在一起,并获得了开始的时间和最后的结束时间,以产生所需的输出.

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

只是使解决方案更容易出错(以防我们有一个数据点比一组(组和索引)中的其他所有数据点更早开始但结束得比其他数据点更早,就像ID中的id的数据点中的数据点一样)6和7)我将 first() last()更改为 min() max().

Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first() and last() to min() and max().

所以...

my_time_intervals %>% group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), end_time = max(end_time)) 


# # A tibble: 7 x 4
# # Groups:   group [?]
# group  indx start_time          end_time           
# <int> <dbl> <dttm>              <dttm>             
# 1     1     0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     1     2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4     1     3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5     2     0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6     2     1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7     3     0 2018-03-11 22:30:51 2018-10-20 21:01:42

我们使用每个重叠时间和日期的唯一索引来获取每个时间和日期的时间段(开始和结束).

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

除了这一点,您还需要阅读 cumsum cummax 的内容,并查看这两个函数针对此特定问题的输出,以了解为什么我进行比较制作,最终为我们提供了每个重叠时间和日期的唯一标识符.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

希望这会有所帮助,因为这是我最好的.

Hope this helps, as it is my best.

这篇关于折叠并合并重叠的时间间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆