如何按时间间隔连接2个数据表并按因子变量汇总重叠和不重叠的时间段 [英] How to join 2 data tables by time interval and summarize overlapping and non-overlapping time periods by factor variable
问题描述
我有2个数据表,每个数据表都列出了观察工作量的周期以及工作量的类型(A,B,C)。
我想知道工作重叠和不重叠的持续时间。
I have 2 data tables, each lists periods of observation effort, and type of effort (A,B,C). I would like to know the duration of time for the overlapping and non-overlapping periods of effort.
我尝试使用data.table来做到这一点。和foverlaps,但无法弄清楚如何包括所有非重叠时间段。
I've tried to do this with data.table and foverlaps, but can't figure out how to include all the non-overlapping periods.
这是我的示例数据。我首先创建了2个包含工作时间的数据表。我的数据集将包括一个观察者在努力的时间段。
Here is my example data. I first created 2 data tables containing periods of effort. My dataset will include periods of time when a single observer is on effort.
library(data.table)
library(lubridate)
# times have been edited so not fixed to minute intervals - to make more realistic
set.seed(13)
EffortType = sample(c("A","B","C"), 100, replace = TRUE)
On = sample(seq(as.POSIXct('2016/01/01 01:00:00'), as.POSIXct('2016/01/03 01:00:00'), by = "1 sec"), 100, replace=F)
Off = On + minutes(sample(1:60, 100, replace=T))
Effort1 = data.table(EffortType, On, Off)
EffortType2 = sample(c("A","B","C"), 100, replace = TRUE)
On2 = sample(seq(as.POSIXct('2016/01/01 12:00:00'), as.POSIXct('2016/01/03 12:00:00'), by = "1 sec"), 100, replace=F)
Off2 = On2 + minutes(sample(1:60, 100, replace=T))
Effort2 = data.table(EffortType2, On2, Off2)
#prep for using foverlaps
setkey(Effort1, On, Off)
setkey(Effort2, On2, Off2)
然后我用护腿板找到工作重叠的地方。我设置了nomatch = NA,但这给了我正确的外部连接。我想要完整的外部连接。所以我想知道更合适的功能是什么。
Then I use foverlaps to find where the effort overlaps. I've set nomatch=NA, but this just gives me the right outer join. I would like the full outer join. And so i wonder what the more appropriate function would be.
matches = foverlaps(Effort1,Effort2,type="any",nomatch=NA)
我在这里继续说明如何确定持续时间所有重叠和不重叠的移位时间。但是我也不认为这部分内容正确。
I've continued on here to show how I've tried to determine the duration of all the overlapping and non-overlapping shift times. But I don't think I've got this part correct either.
# find start and end of intersection of all shifts
matches$start = pmax(matches$On, matches$On2, na.rm=T)
matches$end = pmin(matches$Off, matches$Off2, na.rm=T)
# create intervals and find durations
matches$int = interval(matches$start, matches$end)
matches$dur = as.duration(matches$int)
然后我想总结每个 EffortType分组的观察工作时间量
I would then like sum up the amount of observation effort time for each grouping of "EffortType"
最后得到这样的结果(数字仅是示例,因为即使在excel中,我也无法弄清楚如何正确计算该值)
And end up with something like this (numbers are examples only because I have not managed to figure out how to calculate this correctly, even in excel)
EffortType Duration(in minutes)
A 10
B 20
C 12
AA 8
BB 6
CC 1
AC 160
AB 200
BC 150
推荐答案
不是全部答案(请参阅最后一段)。但是我认为这会为您提供所需的内容。
Not the entire answer (see last paragraph).. but I think this will get you what you want.
library( data.table )
library( lubridate )
set.seed(13)
EffortType = sample(c("A","B","C"), 100, replace = TRUE)
On = sample(seq(as.POSIXct('2016/01/01 01:00:00'), as.POSIXct('2016/01/03 01:00:00'), by = "15 mins"), 100, replace=T)
Off = On + minutes(sample(1:60, 100, replace=T))
Effort1 = data.table(EffortType, On, Off)
EffortType2 = sample(c("A","B","C"), 100, replace = TRUE)
On = sample(seq(as.POSIXct('2016/01/01 12:00:00'), as.POSIXct('2016/01/03 12:00:00'), by = "15 mins"), 100, replace=T)
Off = On + minutes(sample(1:60, 100, replace=T))
Effort2 = data.table(EffortType2, On, Off)
#create DT of minutes, spanning your entire period.
dt.minutes <- data.table( On = seq(as.POSIXct('2016/01/01 01:00:00'), as.POSIXct('2016/01/03 12:00:00'), by = "1 mins"),
Off = seq(as.POSIXct('2016/01/01 01:00:00'), as.POSIXct('2016/01/03 12:00:00'), by = "1 mins") + 60 )
#prep for using foverlaps
setkey(Effort1, On, Off)
setkey(Effort2, On, Off)
#overlap join both efforts on the dt.minutes. note the use of "within" an "nomatch" to throw away minutes without events.
m1 <- foverlaps(dt.minutes, Effort1 ,type="within",nomatch=0L)
m2 <- foverlaps(dt.minutes, Effort2 ,type="within",nomatch=0L)
#bind together
result <- rbindlist(list(m1,m2))[, `:=`(On=i.On, Off = i.Off)][, `:=`(i.On = NULL, i.Off = NULL)]
#cast the result
result.cast <- dcast( result, On + Off ~ EffortType, value.var = "EffortType")
结果
head( result.cast, 10)
# On Off A B C
# 1: 2016-01-01 01:00:00 2016-01-01 01:01:00 1 0 1
# 2: 2016-01-01 01:01:00 2016-01-01 01:02:00 1 0 1
# 3: 2016-01-01 01:02:00 2016-01-01 01:03:00 1 0 1
# 4: 2016-01-01 01:03:00 2016-01-01 01:04:00 1 0 1
# 5: 2016-01-01 01:04:00 2016-01-01 01:05:00 1 0 1
# 6: 2016-01-01 01:05:00 2016-01-01 01:06:00 1 0 1
# 7: 2016-01-01 01:06:00 2016-01-01 01:07:00 1 0 1
# 8: 2016-01-01 01:07:00 2016-01-01 01:08:00 1 0 1
# 9: 2016-01-01 01:08:00 2016-01-01 01:09:00 1 0 1
# 10: 2016-01-01 01:09:00 2016-01-01 01:10:00 1 0 1
有时事件发生2-3在同一分钟内的时间,例如
Sometimes a event occurs 2-3 times within the same minute, like
# On Off A B C
#53: 2016-01-02 14:36:00 2016-01-02 14:37:00 2 2 3
不确定您想要的方式总结一下...
Not sure on how you want to sum that...
如果您可以将它们视为一分钟,则:
If you can treat them as a single minute, then:
> sum( result.cast[A>0 & B==0, C==0, ] )
[1] 476
> sum( result.cast[A==0 & B>0, C==0, ] )
[1] 386
> sum( result.cast[A==0 & B==0, C>0, ] )
[1] 504
> sum( result.cast[A>0 & B>0, C==0, ] )
[1] 371
> sum( result.cast[A==0 & B>0, C>0, ] )
[1] 341
> sum( result.cast[A>0 & B==0, C>0, ] )
[1] 472
> sum( result.cast[A>0 & B>0, C>0, ] )
[1] 265
可以在数分钟内完成持续时间的技巧,我想(尽管这可以用更聪明的方式完成)
will do the trick to get duration in minutes, I think (although this can probably be done in a much smarter way)
这篇关于如何按时间间隔连接2个数据表并按因子变量汇总重叠和不重叠的时间段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!