使用data.table查找时间戳对之间的重叠持续时间 [英] Find the duration of overlap between pairs of timestamps using data.table
问题描述
类似于这个问题,我想使用 data.table
查找时间戳对之间的重叠持续时间。 p>
这是我当前的代码:
库(data.table)
DT<-fread(
stage,ID,date1,date2
1,A,2018-04-17 00:00:00,2018-04-17 01: 00:00
1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
1,C,2018-04-17 00:15:00 ,2018-04-17 01:00:00
2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
2,D,2018- 04-17 00:30:00,2018-04-17 00:50:00,
sep =,
)
cols <-c( date1, date2)
DT [,(cols):= lapply(.SD,as.POSIXct),.SDcols = cols]
中断<-DT [,{
tmp<-unique(sort(c(date1,date2)))
。(开始= head(tmp,-1L),end = tail(tmp,-1L))
},按=阶段]
结果<-DT [breaks,on =。(stage,date1< =开始,date2> =结束),paste(ID,崩溃= +),
by = .EACHI,allow。 = T]%&%;%
mutate(lengthinseconds = as.numeric(difftime(date2,date1,units = secs))))
返回哪个:
阶段date1 date2 V1长度秒数
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B + A 900
2 1 2018-04-17 00:15:00 2018-04-17 00:20:00 B + A + C 300
3 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A + C 2400
4 2 2018-04-17 00:30 :00 2018-04-17 00:50:00 D + B 1200
5 2 2018-04-17 00:50:00 2018-04-17 01:10:00 B 1200
但我只想返回用户dyad之间的重叠(即不超过两个重叠的用户)。我可以想到几种方法,例如:
library(dplyr)
library(tidyr )
结果%>%
filter(nchar(V1)== 3)%&%;%
tidyr :: separate(V1,c( ID1, ID2))
返回哪个:
阶段date1 date2 ID1 ID2 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 BA 900
2 1 2018-04-17 00:20:00 2018-04-17 01:00:00 AC 2400
3 2 2018-04-17 00:30:00 2018-04-17 00:50:00 DB 1200
但这似乎并不优雅,尤其是在处理较长的 ID $ c时$ c>字符串,每个重叠可能有数百个
ID
。
理想情况下,我想知道是否有一种方法可以修改原始的 data.table
代码以直接返回此代码。
乍看之下引用性能方面的考虑),这仅需对OP的代码进行较小的修改:
result<-DT [breaks,on =。 (stage,date1< =开始,date2> =结束),
如果(.N == 2L)paste(ID,崩溃= +),
by = .EACHI,允许。 cartesian = TRUE]
结果
阶段date1 date2 V1
1:1 2018-04-17 00:00:00 2018-04-17 00:15:00 B + A
2:1 2018-04-17 00:20:00 2018-04-17 01:00:00 A + C
3:2 2018-04-17 00:30:00 2018-04-17 00:50:00 D + B
仅适用于那些组,即时间范围,其中恰好有两个用户有效的结果行将被创建。
OP要求显示两个 ID
放在不同的列中,以显示重叠的持续时间。另外,我建议对 ID
进行排序。
结果<-DT [breaks,on =。(stage,date1< = start,date2> = end),
if(.N == 2L){
tmp<-sort(ID )
。(ID1 = tmp [1],ID2 = tmp [2],dur.in.sec = difftime(end,start,units = secs))
},
由= .EACHI,allow.cartesian = TRUE]
结果
阶段date1 date2 ID1 ID2 dur.in.sec
1:1 2018-04-17 00:00:00 2018-04-17 00:15:00 AB 900 secs
2:1 2018-04-17 00:20:00 2018-04-17 01:00:00 AC 2400 secs
3:2 2018-04-17 00:30:00 2018- 04-17 00:50:00 BD 1200秒
Similar to this question, I'd like to find the duration of overlap between pairs of timestamps using data.table
.
Here's my current code:
library(data.table)
DT <- fread(
"stage,ID,date1,date2
1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
sep = ","
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
breaks <- DT[, {
tmp <- unique(sort(c(date1, date2)))
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),
by = .EACHI, allow.cartesian = T] %>%
mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))
Which returns:
stage date1 date2 V1 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A 900
2 1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C 300
3 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C 2400
4 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B 1200
5 2 2018-04-17 00:50:00 2018-04-17 01:10:00 B 1200
But I'd like to return only overlaps between user dyads (i.e. no more than two overlapping users). There are several hacky ways I can think of achieve this, such as:
library(dplyr)
library(tidyr)
result %>%
filter(nchar(V1)==3) %>%
tidyr::separate(V1, c("ID1", "ID2"))
Which returns:
stage date1 date2 ID1 ID2 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B A 900
2 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400
3 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D B 1200
But this seems inelegant, especially when dealing with longer ID
strings and potentially hundreds of ID
s per overlap.
Ideally, I'd like to know if there's a way to modify the original data.table
code to return this directly.
At first glance (and neglecting performance considerations), this requires only a minor modification to OP's code:
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end),
if (.N == 2L) paste(ID, collapse = "+"),
by = .EACHI, allow.cartesian = TRUE]
result
stage date1 date2 V1 1: 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A 2: 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C 3: 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B
Only for those groups, i.e., time ranges, where exactly two users are active a result row will be created.
The OP has requested to show the two ID
s in separate columns plus to show the duration of the overlap. In addition, I suggest to have the ID
s sorted.
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end),
if (.N == 2L) {
tmp <- sort(ID)
.(ID1 = tmp[1], ID2 = tmp[2], dur.in.sec = difftime(end, start, units = "secs"))
},
by = .EACHI, allow.cartesian = TRUE]
result
stage date1 date2 ID1 ID2 dur.in.sec 1: 1 2018-04-17 00:00:00 2018-04-17 00:15:00 A B 900 secs 2: 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400 secs 3: 2 2018-04-17 00:30:00 2018-04-17 00:50:00 B D 1200 secs
这篇关于使用data.table查找时间戳对之间的重叠持续时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!