在data.table中添加具有条件的虚拟对象? [英] Add dummies with conditions in data.table?
问题描述
很抱歉,长的问题。我将尽力澄清我的目标clealy
我想使用update方法在data.table中添加虚拟对象,就像这个链接已经回答了,但有点复杂。
为了更好的描述,我创建了数据。
DT < - data.table(UID = paste0(UID,rep(1:5,each = 2) ),
date = as.IDate(c(2012-01-01,2012-01-02,2012-01-03,2012-01-04,2012-01 -05,2012-01-06,2012-02-01,2012-02-02,2012-02-03,2012-02-04)),
value = c(1:10))
DT是一个包含UID信息的data.table ,日期和值。在原始数据中,结构是相同的,但是具有长的时间跨度(2年)。
这里我想根据日期添加假人。
日期有几个特殊的时间跨度,我们可以使用假期来代表。
有两个假期
- 从2012-01-02到2012-01-05
- 从2012-02-02到2012-02-03
我想添加两种类型的假人
- 度假的长度:首先从不同的假期计算长度。在这个例子中,我们有两个不同的长度(2和4)。
- 无论假期是在假期前一天还是刚好一天
- From "2012-01-02" to "2012-01-05"
- From "2012-02-02" to "2012-02-03"
- Dummies about the length of vacation: First calculate the length from different vacations. In this example, we have two different lengths (2, and 4 ). So we'll add 2 dummies indicating whether the date are in these vacations.
- Dummies about whether the day is exactly one day before the vacation, or exactly one day after the vacation.
预期结果如下:
UID日期Val D_length_2 D_length_4
UID1 1/1/2012 1 FALSE FALSE
UID2 1/2/2012 2 FALSE TRUE
UID3 1/3/2012 3 FALSE TRUE
UID4 1/4/2012 4 FALSE TRUE
UID5 1/5/2012 5 FALSE TRUE
UID1 1/6/2012 6 FALSE FALSE
UID2 2/1/2012 7 TRUE FALSE
UID3 2/2/2012 8 TRUE FALSE
UID4 2/3/2012 9 FALSE FALSE
UID5 2/4/2012 10 FALSE FALSE
UID日期Val之前之前
UID1 1/1/2012 1 TRUE FALSE
UID2 1/2/2012 2 FALSE FALSE
UID3 1/3/2012 3 FALSE FALSE
UID4 1/4/2012 4 FALSE FALSE
UID5 1/5/2012 5 FALSE FALSE
UID1 1/6/2012 6 FALSE TRUE
UID2 2/1/2012 7 TRUE FALSE
UID3 2/2/2012 8 FALSE FALSE
UID4 2/3 / 2012 9 FALSE FALSE
UID5 2/4/2012 10 FALSE TRUE
因此,所需的结果总数如下
UID日期之前之前D_length_2 D_length_4
UID1 1/1/2012 1 TRUE FALSE FALSE FALSE
UID2 1 / 2/2012 2 FALSE FALSE FALSE TRUE
UID3 1/3/2012 3 FALSE FALSE FALSE TRUE
UID4 1/4/2012 4 FALSE FALSE FALSE TRUE
UID5 1/5/2012 5 FALSE FALSE FALSE TRUE
UID1 1/6/2012 6 FALSE TRUE FALSE FALSE
UID2 2/1/2012 7 TRUE FALSE FALSE
UID3 2/2/2012 8 FALSE FALSE TRUE FALSE
UID4 2/3/2012 9 FALSE FALSE TRUE FALSE
UID5 2/4/2012 10 FALSE TRUE FALSE FALSE
总观察次数超过1000万行,有大约10个不同的假期和4个不同的长度。
对于第二类假人,我认为
f < - function(x){
ifelse(x%in%as.Date(c(2012-01-02, 2012-02-02)) - 1,return(TRUE),return(FALSE))
}
DT [,Before:= f(date)]
但似乎不正确。
对于第一个,我没有提出一个好的解决方案。
问题是关于data.table中的更新,任何关于如何处理它和如何编写更新函数的想法是非常受欢迎的!
这里是一个开始:
library(data.table)
$ b b DT date = as.IDate(c(2012-01-01 2012-01-02,2012-01-03,2012-01-04,2012-01-05,2012-01-06,2012-02-01,2012- 02-02,2012-02-03,2012-02-04)),
value = c(1:10))
setkey(DT,date)
vacStart< - data.table(start = as.IDate(c(2012-01-02,2012-02-02)),key =start)
vacEnd< - data.table(date = as.IDate(c(2012-01-05,2012-02-03)),key =date)
#identify vacations:
vacStart [,Start:=。I]
DT < - vacStart [DT,roll = TRUE]
vacEnd [,End:=。I]
DT < - vacEnd [DT,roll = -Inf]
DT [,vac:=(End == Start)* Start]
DT [is.na(vac),vac:= 0L]
#2天假期:
DT [,length_2:=(.N == 2)&假期前$ d $ b DT [,之前:= c(diff(vac)> 0,FALSE)& vac == 0]
#date结束开始UID值vac length_2之前
#1:2012-01-01 1 NA UID1 1 0 FALSE TRUE
#2:2012-01-02 1 1 UID1 2 1 FALSE FALSE
#3:2012-01-03 1 1 UID2 3 1 FALSE FALSE
#4:2012-01-04 1 1 UID2 4 1 FALSE FALSE
#5 :2012-01-05 1 1 UID3 5 1 FALSE FALSE
#6:2012-01-06 2 1 UID3 6 0 FALSE FALSE
#7:2012-02-01 2 1 UID4 7 0 FALSE TRUE
#8:2012-02-02 2 2 UID4 8 2 TRUE FALSE
#9:2012-02-03 2 2 UID5 9 2 TRUE FALSE
#10:2012-02- 04 NA 2 UID5 10 0 FALSE FALSE
Sorry for the long question. I will try my best to clarify my goal clealy
I want to add dummies in data.table using the update method, just like this already answered in this link, but a little bit more complicated.
For better description, I created the data.
DT <- data.table(UID = paste0("UID",rep(1:5,each=2)),
date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
value = c(1:10))
The DT is a data.table containing the information of UID, date, and value. In the original data, the structure is just the same, but with long time span(2 years).
Here I want to add dummies based on the date.
there're several special time spans in date, we can just use vacations to represent them.
For example, in the fake data I created above.
There're two vacations
I want to add 2 types of dummies
The expected result is like this:
UID Date Val D_length_2 D_length_4 UID1 1/1/2012 1 FALSE FALSE UID2 1/2/2012 2 FALSE TRUE UID3 1/3/2012 3 FALSE TRUE UID4 1/4/2012 4 FALSE TRUE UID5 1/5/2012 5 FALSE TRUE UID1 1/6/2012 6 FALSE FALSE UID2 2/1/2012 7 TRUE FALSE UID3 2/2/2012 8 TRUE FALSE UID4 2/3/2012 9 FALSE FALSE UID5 2/4/2012 10 FALSE FALSE
UID Date Val Before After UID1 1/1/2012 1 TRUE FALSE UID2 1/2/2012 2 FALSE FALSE UID3 1/3/2012 3 FALSE FALSE UID4 1/4/2012 4 FALSE FALSE UID5 1/5/2012 5 FALSE FALSE UID1 1/6/2012 6 FALSE TRUE UID2 2/1/2012 7 TRUE FALSE UID3 2/2/2012 8 FALSE FALSE UID4 2/3/2012 9 FALSE FALSE UID5 2/4/2012 10 FALSE TRUE
So the total of desired results is like this
UID Date Val Before After D_length_2 D_length_4 UID1 1/1/2012 1 TRUE FALSE FALSE FALSE UID2 1/2/2012 2 FALSE FALSE FALSE TRUE UID3 1/3/2012 3 FALSE FALSE FALSE TRUE UID4 1/4/2012 4 FALSE FALSE FALSE TRUE UID5 1/5/2012 5 FALSE FALSE FALSE TRUE UID1 1/6/2012 6 FALSE TRUE FALSE FALSE UID2 2/1/2012 7 TRUE FALSE FALSE FALSE UID3 2/2/2012 8 FALSE FALSE TRUE FALSE UID4 2/3/2012 9 FALSE FALSE TRUE FALSE UID5 2/4/2012 10 FALSE TRUE FALSE FALSE
The total observations are more than 10M rows, with about 10 different vacations and 4 different length.
For the second type of dummies, I think
f <- function(x){
ifelse(x %in% as.Date(c("2012-01-02","2012-02-02")) - 1, return(TRUE), return(FALSE))
}
DT[,Before:= f(date)]
But it seems not correct.
For the first one, I didn't come up with a good solution.
this problem is about the update in data.table, any thoughts about how to deal with it and how to write the update functions are extremely welcome!
Here is a start:
library(data.table)
DT <- data.table(UID = paste0("UID",rep(1:5,each=2)),
date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
value = c(1:10))
setkey(DT, date)
vacStart <- data.table(start = as.IDate(c("2012-01-02", "2012-02-02")), key="start")
vacEnd <- data.table(date = as.IDate(c("2012-01-05", "2012-02-03")), key="date")
#identify vacations:
vacStart[, Start:=.I]
DT <- vacStart[DT, roll=TRUE]
vacEnd[, End:=.I]
DT <- vacEnd[DT, roll=-Inf]
DT[,vac:=(End==Start)*Start]
DT[is.na(vac), vac:=0L]
#2-day vacations:
DT[,length_2 := (.N==2) & vac!=0, by=vac]
#days before vacation
DT[,before := c(diff(vac)>0, FALSE) & vac==0]
# date End Start UID value vac length_2 before
# 1: 2012-01-01 1 NA UID1 1 0 FALSE TRUE
# 2: 2012-01-02 1 1 UID1 2 1 FALSE FALSE
# 3: 2012-01-03 1 1 UID2 3 1 FALSE FALSE
# 4: 2012-01-04 1 1 UID2 4 1 FALSE FALSE
# 5: 2012-01-05 1 1 UID3 5 1 FALSE FALSE
# 6: 2012-01-06 2 1 UID3 6 0 FALSE FALSE
# 7: 2012-02-01 2 1 UID4 7 0 FALSE TRUE
# 8: 2012-02-02 2 2 UID4 8 2 TRUE FALSE
# 9: 2012-02-03 2 2 UID5 9 2 TRUE FALSE
# 10: 2012-02-04 NA 2 UID5 10 0 FALSE FALSE
这篇关于在data.table中添加具有条件的虚拟对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!