基于另一个表中的日期范围在一个表中创建虚拟变量 [英] Create dummy variables in one table based on range of dates in another table
问题描述
我有两个表。 table1
看起来像这样
I have two tables. table1
looks like this
date hour data
2010-05-01 3 5
2010-05-02 7 7
2010-05-02 10 8
2010-07-03 18 3
2011-12-09 22 1
2012-05-01 3 0
这是存储为
。 date
和 hour
上设置键的数据表
我有另一个表,看起来像这样。这是我的停机
表。
This is stored as a data.table
with key set on date
and hour
.
I have another table, that looks like this. It's my outages
table.
resource date_out date_back
joey 2010-04-30 4:00:00 2010-05-02 8:30:00
billy 2009-04-20 7:00:00 2009-02-02 5:30:00
bob 2011-11-15 12:20:00 2010-12-09 23:00:00
joey 2012-04-28 1:00:00 2012-05-02 17:00:00
我想添加列到 table1
从中断
表。我想要这些列中的值为0,当有没有中断时,为1。
I want to add columns to table1
where those columns are the resource from the outages
table. I want the values in those columns to be 0 for whenever there isn't an outage and 1 for when there is.
此示例的结果应为
The result for this example should be.
date hour data joey billy bob
2010-05-01 3 5 1 0 0
2010-05-02 7 7 1 0 0
2010-05-02 10 8 0 0 0
2010-07-03 18 3 0 0 0
2011-12-09 22 1 0 0 1
2012-05-01 3 0 1 0 0
实际上,我的 table1
有大约2500行,我的中断
表有19000.我可以想到这样做的唯一方法是循环遍历中断
表,然后在正确的位置插入1到 table1
。我的代码依赖 table1
是为了所以至少它不必扫描100%的表中的每一行中断
。但下面的资料需要4小时以上。
In actuality my table1
has about 2500 rows and my outages
table has 19000. The only way I could think to do this is to loop through each row of the outages
table and then insert 1s into table1
in the correct places. My code relies on table1
being in order so at least it doesn't have to scan 100% of that table for every row of outages
. However the below takes over 4 hours for my data.
for (out in 1:length(outages$resource)) {
a<-as.character(outages[out]$resource)
#if column doesn't exist then create it
if (a %in% colnames(table1)==FALSE) {
table1$new<-0
setnames(table1, "new", a)
}
midpoint<-round(length(table1$date)/2,0)
if (table1$date[midpoint]+table1$hour[midpoint]*60*60>=outages[out]$due_out && table1$date[midpoint]+table1$hour[midpoint]*60*60<=outages [out]$due_back)
{
while(table1$date[midpoint]+table1$hour[midpoint]*60*60>=outages[out]$due_out && midpoint>=1 && midpoint<=length(table1$date)) {
table1[midpoint,a:=1,with=FALSE]
midpoint<-midpoint-1
}
midpoint<-round(length(table1$date)/2,0)
while(table1$date[midpoint]+table1$hour[midpoint]*60*60<=outages[out]$due_back && midpoint>=1 && midpoint<=length(table1$date)) {
table1[midpoint,a:=1,with=FALSE]
midpoint<-midpoint+1
}
} else {
if (table1$date[midpoint]+table1$hour[midpoint]*60*60>outages[out]$due_back) {
while(table1$date[midpoint]+table1$hour[midpoint]*60*60>outages[out]$due_back && midpoint>=1 && midpoint<=length(table1$date)) {
midpoint<-midpoint-1
}
while(table1$date[midpoint]+table1$hour[midpoint]*60*60>=outages[out]$due_out && midpoint>=1 && midpoint<=length(table1$date)) {
table1[midpoint,a:=1,with=FALSE]
midpoint<-midpoint-1
}
}
midpoint<-round(length(table1$date)/2,0)
if (table1$date[midpoint]+table1$hour[midpoint]*60*60<outages[out]$due_out) {
while(table1$date[midpoint]+table1$hour[midpoint]*60*60<outages[out]$due_out && midpoint>=1 && midpoint<=length(table1$date)) {
midpoint<-midpoint+1
}
while(table1$date[midpoint]+table1$hour[midpoint]*60*60<=outages[out]$due_back && midpoint>=1 && midpoint<=length(table1$date)) {
table1[midpoint,a:=1,with=FALSE]
midpoint<-midpoint+1
}
}
}
if (sum(table1[,a,with=FALSE])==0) {
table1[,a:=NULL,with=FALSE]
}
}
要引用每个人最喜欢的电视节目必须有更好的方法。
To quote everybody's favorite infomercial line "There's got to be a better way".
推荐答案
这里是实现你想要的方法。这假设您的 table1
的时间精度为1小时。虽然它可以被修改为任意精度,它将执行更好的更大的时间间隔,因为它构建的可能时间的完整序列在 date_out
- date_back
range。注意,我使用与OP略有不同的表来说明重叠间隔,并纠正OP中的一些错误。
Here's a way of achieving what you want. This assumes your table1
's time precision is 1 hour. Though it can be modified to an arbitrary precision, it will perform much better for larger time intervals as it constructs the full sequence of possible times in the date_out
-date_back
range. Note, I used slightly different tables from OP to illustrate overlapping intervals and to correct some mistakes in OP.
table1 = data.table(date = c("2010-05-01", "2010-05-02", "2010-05-02", "2010-07-03", "2011-12-09", "2012-05-01"), hour = c(3,7,10,18,22,3), data = c(5,7,8,3,1,0))
outages = data.table(resource = c("joey", "bob", "billy", "bob", "joey"), date_out = c("2010-04-30 4:00:00", "2010-04-30 4:00:00", "2009-04-20 7:00:00", "2011-11-15 12:20:00", "2012-04-28 1:00:00"), date_back=c("2010-05-02 8:30:00", "2010-05-02 8:30:00", "2009-06-02 5:30:00", "2011-12-09 23:00:00", "2012-05-02 17:00:00"))
# round up date_out and round down date_back
# and create a sequence in-between spaced by 1 hour
outages[, list(datetime = seq(as.POSIXct(round(as.POSIXct(date_out) + 30*60-1, "hours")),
as.POSIXct(round(as.POSIXct(date_back) - 30*60, "hours")),
60*60)),
by = list(resource, date_out)] -> outages.expanded
setkey(outages.expanded, datetime)
# merge with the original table, then run "table" to get the frequencies/occurences
# and cbind back with the original table
cbind(table1, unclass(table(
outages.expanded[table1[, list(datetime=as.POSIXct(paste0(date, " ", hour, ":00:00")))],
resource])))
# date hour data bob joey
#1: 2010-05-01 3 5 1 1
#2: 2010-05-02 7 7 1 1
#3: 2010-05-02 10 8 0 0
#4: 2010-07-03 18 3 0 0
#5: 2011-12-09 22 1 1 0
#6: 2012-05-01 3 0 0 1
这篇关于基于另一个表中的日期范围在一个表中创建虚拟变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!