使用data.table查找时间戳对之间的重叠持续时间 [英] Find the duration of overlap between pairs of timestamps using data.table

查看:73
本文介绍了使用data.table查找时间戳对之间的重叠持续时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

类似于这个问题,我想使用 data.table 查找时间戳对之间的重叠持续时间。 p>

这是我当前的代码:

 库(data.table)

DT<-fread(
stage,ID,date1,date2
1,A,2018-04-17 00:00:00,2018-04-17 01: 00:00
1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
1,C,2018-04-17 00:15:00 ,2018-04-17 01:00:00
2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
2,D,2018- 04-17 00:30:00,2018-04-17 00:50:00,
sep =,


cols <-c( date1, date2)
DT [,(cols):= lapply(.SD,as.POSIXct),.SDcols = cols]

中断<-DT [,{
tmp<-unique(sort(c(date1,date2)))
。(开始= head(tmp,-1L),end = tail(tmp,-1L))
},按=阶段]

结果<-DT [breaks,on =。(stage,date1< =开始,date2> =结束),paste(ID,崩溃= +),
by = .EACHI,allow。 = T]%&%;%
mutate(lengthinseconds = as.numeric(difftime(date2,date1,units = secs))))

返回哪个:

 阶段date1 date2 V1长度秒数
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B + A 900
2 1 2018-04-17 00:15:00 2018-04-17 00:20:00 B + A + C 300
3 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A + C 2400
4 2 2018-04-17 00:30 :00 2018-04-17 00:50:00 D + B 1200
5 2 2018-04-17 00:50:00 2018-04-17 01:10:00 B 1200

但我只想返回用户dyad之间的重叠(即不超过两个重叠的用户)。我可以想到几种方法,例如:

  library(dplyr)
library(tidyr )

结果%>%
filter(nchar(V1)== 3)%&%;%
tidyr :: separate(V1,c( ID1, ID2))

返回哪个:

 阶段date1 date2 ID1 ID2 lengthinseconds 
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 BA 900
2 1 2018-04-17 00:20:00 2018-04-17 01:00:00 AC 2400
3 2 2018-04-17 00:30:00 2018-04-17 00:50:00 DB 1200

但这似乎并不优雅,尤其是在处理较长的 ID 字符串,每个重叠可能有数百个 ID



理想情况下,我想知道是否有一种方法可以修改原始的 data.table 代码以直接返回此代码。

解决方案

乍看之下引用性能方面的考虑),这仅需对OP的代码进行较小的修改:

  result<-DT [breaks,on =。 (stage,date1< =开始,date2> =结束),
如果(.N == 2L)paste(ID,崩溃= +),
by = .EACHI,允许。 cartesian = TRUE]
结果




 阶段date1 date2 V1 
1:1 2018-04-17 00:00:00 2018-04-17 00:15:00 B + A
2:1 2018-04-17 00:20:00 2018-04-17 01:00:00 A + C
3:2 2018-04-17 00:30:00 2018-04-17 00:50:00 D + B


仅适用于那些组,即时间范围,其中恰好有两个用户有效的结果行将被创建。






OP要求显示两个 ID 放在不同的列中,以显示重叠的持续时间。另外,我建议对 ID 进行排序。

 结果<-DT [breaks,on =。(stage,date1< = start,date2> = end),
if(.N == 2L){
tmp<-sort(ID )
。(ID1 = tmp [1],ID2 = tmp [2],dur.in.sec = difftime(end,start,units = secs))
},
由= .EACHI,allow.cartesian = TRUE]
结果




 阶段date1 date2 ID1 ID2 dur.in.sec 
1:1 2018-04-17 00:00:00 2018-04-17 00:15:00 AB 900 secs
2:1 2018-04-17 00:20:00 2018-04-17 01:00:00 AC 2400 secs
3:2 2018-04-17 00:30:00 2018- 04-17 00:50:00 BD 1200秒



Similar to this question, I'd like to find the duration of overlap between pairs of timestamps using data.table.

Here's my current code:

library(data.table)

DT <- fread(
  "stage,ID,date1,date2
  1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
  1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
  1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
  2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
  2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
  sep = ","
)

cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]

breaks <- DT[, {
  tmp <- unique(sort(c(date1, date2)))
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),  
    by = .EACHI, allow.cartesian = T] %>% 
  mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))

Which returns:

  stage               date1               date2    V1 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B+A             900
2     1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C             300
3     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A+C            2400
4     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D+B            1200
5     2 2018-04-17 00:50:00 2018-04-17 01:10:00     B            1200

But I'd like to return only overlaps between user dyads (i.e. no more than two overlapping users). There are several hacky ways I can think of achieve this, such as:

library(dplyr)
library(tidyr)

result %>% 
  filter(nchar(V1)==3) %>% 
  tidyr::separate(V1, c("ID1", "ID2"))

Which returns:

  stage               date1               date2 ID1 ID2 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200

But this seems inelegant, especially when dealing with longer ID strings and potentially hundreds of IDs per overlap.

Ideally, I'd like to know if there's a way to modify the original data.table code to return this directly.

解决方案

At first glance (and neglecting performance considerations), this requires only a minor modification to OP's code:

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
             if (.N == 2L) paste(ID, collapse = "+"),  
             by = .EACHI, allow.cartesian = TRUE]
result

   stage               date1               date2  V1
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B

Only for those groups, i.e., time ranges, where exactly two users are active a result row will be created.


The OP has requested to show the two IDs in separate columns plus to show the duration of the overlap. In addition, I suggest to have the IDs sorted.

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
   if (.N == 2L) {
     tmp <- sort(ID)
     .(ID1 = tmp[1], ID2 = tmp[2], dur.in.sec = difftime(end, start, units = "secs"))
     },  
   by = .EACHI, allow.cartesian = TRUE]
result

   stage               date1               date2 ID1 ID2 dur.in.sec
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   A   B   900 secs
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C  2400 secs
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   B   D  1200 secs

这篇关于使用data.table查找时间戳对之间的重叠持续时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆