在R中,我如何拆分&将带有ID的时间戳记间隔数据聚合到常规时隙中? [英] In R, how do I split & aggregate timestamp interval data with IDs into regular slots?

查看:98
本文介绍了在R中,我如何拆分&将带有ID的时间戳记间隔数据聚合到常规时隙中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在按照

I'm working on next step of my data aggregation following previous question. There Jon Spring pointed me to a solution for indicating number of active events in given time interval.

下一步,我希望能够汇总此数据并获取在固定时间间隔内任何时间处于活动状态的具有相同ID的观测值.

At next step I'd like to be able to aggregate this data and obtain number of observations with same ID that were active at any point during the fixed time interval.

从具有七个ID的七个事件的玩具数据集开始:

Starting with a toy dataset of seven events with five IDs:

library(tidyverse); library(lubridate)

df1 <- tibble::tibble(
  id = c("a", "b", "c", "c", "c", "d", "e"),
  start = c(ymd_hms("2018-12-10 13:01:00"),
                 ymd_hms("2018-12-10 13:07:00"),
                 ymd_hms("2018-12-10 14:45:00"),
                 ymd_hms("2018-12-10 14:48:00"),
                 ymd_hms("2018-12-10 14:52:00"),
                 ymd_hms("2018-12-10 14:45:00"),
                 ymd_hms("2018-12-10 14:45:00")),
  end = c(ymd_hms("2018-12-10 13:05:00"),
               ymd_hms("2018-12-10 13:17:00"),
               ymd_hms("2018-12-10 14:46:00"),
               ymd_hms("2018-12-10 14:50:00"),
               ymd_hms("2018-12-10 15:01:00"),
               ymd_hms("2018-12-10 14:51:00"),
               ymd_hms("2018-12-10 15:59:00")))

我可以在数据帧的每一行上进行蛮力循环,并将每条记录扩展"到指定的间隔,该间隔涵盖从开始到结束的时间段,此处使用15分钟:

I could bruteforce loop over each line of data frame and 'expand' each record to specified intervals that cover time period from start to end, here using 15 minutes:

for (i in 1:nrow(df1)) {

  right <- df1 %>% 
    slice(i) %>% 
    mutate(start_floor = floor_date(start, "15 mins"))

  left <- tibble::tibble(
    timestamp = seq.POSIXt(right$start_floor, 
                           right$end, 
                           by  = "15 mins"),
    id = right$id)

  if (i == 1){
    result <- left
  }
  else {
    result <- bind_rows(result, left) %>% 
      distinct()
  }
}

然后通过简单的聚合即可获得最终结果:

Then it's a matter of simple aggregation to obtain final result:

result_agg <- result %>% 
  group_by(timestamp) %>% 
  summarise(users_mac = n())

这给出了理想的结果,但可能无法很好地扩展到我需要用于它的数据集(目前约有700万条记录..并且还在不断增长).

That gives desired result, but will probably not scale well to dataset I need to use it with (~7 millions records at the moment.. and growing).

有没有更好的解决方案来解决这个问题?

Is there any better solution to this problem?

推荐答案

使用 tsibble 包可以实现整洁的解决方案.

A tidy solution could be achieved using the tsibble package.

library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang
#> Registered S3 method overwritten by 'rvest':
#>   method            from
#>   read_xml.response xml2
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(tsibble, warn.conflicts = FALSE)

df1 <- tibble(
  id = c("a", "b", "c", "c", "c", "d", "e"),
  start = c(ymd_hms("2018-12-10 13:01:00"),
            ymd_hms("2018-12-10 13:07:00"),
            ymd_hms("2018-12-10 14:45:00"),
            ymd_hms("2018-12-10 14:48:00"),
            ymd_hms("2018-12-10 14:52:00"),
            ymd_hms("2018-12-10 14:45:00"),
            ymd_hms("2018-12-10 14:45:00")),
  end = c(ymd_hms("2018-12-10 13:05:00"),
          ymd_hms("2018-12-10 13:17:00"),
          ymd_hms("2018-12-10 14:46:00"),
          ymd_hms("2018-12-10 14:50:00"),
          ymd_hms("2018-12-10 15:01:00"),
          ymd_hms("2018-12-10 14:51:00"),
          ymd_hms("2018-12-10 15:59:00")))

df1 %>% 
  mutate(
    start = floor_date(start, "15 mins"),
    end = floor_date(end, "15 mins")
  ) %>% 
  gather("label", "index", start:end) %>% 
  distinct(id, index) %>%
  mutate(date = as_date(index)) %>% 
  as_tsibble(key = c(id, date), index = index) %>%
  fill_gaps() %>% 
  index_by(index) %>% 
  summarise(users_mac = n())
#> # A tsibble: 7 x 2 [15m] <UTC>
#>   index               users_mac
#>   <dttm>                  <int>
#> 1 2018-12-10 13:00:00         2
#> 2 2018-12-10 13:15:00         1
#> 3 2018-12-10 14:45:00         3
#> 4 2018-12-10 15:00:00         2
#> 5 2018-12-10 15:15:00         1
#> 6 2018-12-10 15:30:00         1
#> 7 2018-12-10 15:45:00         1

reprex软件包(v0.2.1)于2019-05-17创建

这篇关于在R中,我如何拆分&amp;将带有ID的时间戳记间隔数据聚合到常规时隙中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆