在data.table中添加具有条件的虚拟对象? [英] Add dummies with conditions in data.table?

查看:84
本文介绍了在data.table中添加具有条件的虚拟对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很抱歉,长的问题。我将尽力澄清我的目标clealy



我想使用update方法在data.table中添加虚拟对象,就像这个链接已经回答了,但有点复杂。



为了更好的描述,我创建了数据。

  DT < -  data.table(UID = paste0(UID,rep(1:5,each = 2) ),
date = as.IDate(c(2012-01-01,2012-01-02,2012-01-03,2012-01-04,2012-01 -05,2012-01-06,2012-02-01,2012-02-02,2012-02-03,2012-02-04)),
value = c(1:10))

DT是一个包含UID信息的data.table ,日期和值。在原始数据中,结构是相同的,但是具有长的时间跨度(2年)。



这里我想根据日期添加假人。



日期有几个特殊的时间跨度,我们可以使用假期来代表。





有两个假期


  1. 从2012-01-02到2012-01-05

  2. 从2012-02-02到2012-02-03

我想添加两种类型的假人


  1. 度假的长度:首先从不同的假期计算长度。在这个例子中,我们有两个不同的长度(2和4)。


  2. 预期结果如下:

     
    UID日期Val D_length_2 D_length_4
    UID1 1/1/2012 1 FALSE FALSE
    UID2 1/2/2012 2 FALSE TRUE
    UID3 1/3/2012 3 FALSE TRUE
    UID4 1/4/2012 4 FALSE TRUE
    UID5 1/5/2012 5 FALSE TRUE
    UID1 1/6/2012 6 FALSE FALSE
    UID2 2/1/2012 7 TRUE FALSE
    UID3 2/2/2012 8 TRUE FALSE
    UID4 2/3/2012 9 FALSE FALSE
    UID5 2/4/2012 10 FALSE FALSE




    1. 无论假期是在假期前一天还是刚好一天



     
    UID日期Val之前之前
    UID1 1/1/2012 1 TRUE FALSE
    UID2 1/2/2012 2 FALSE FALSE
    UID3 1/3/2012 3 FALSE FALSE
    UID4 1/4/2012 4 FALSE FALSE
    UID5 1/5/2012 5 FALSE FALSE
    UID1 1/6/2012 6 FALSE TRUE
    UID2 2/1/2012 7 TRUE FALSE
    UID3 2/2/2012 8 FALSE FALSE
    UID4 2/3 / 2012 9 FALSE FALSE
    UID5 2/4/2012 10 FALSE TRUE

    因此,所需的结果总数如下

     
    UID日期之前之前D_length_2 D_length_4
    UID1 1/1/2012 1 TRUE FALSE FALSE FALSE
    UID2 1 / 2/2012 2 FALSE FALSE FALSE TRUE
    UID3 1/3/2012 3 FALSE FALSE FALSE TRUE
    UID4 1/4/2012 4 FALSE FALSE FALSE TRUE
    UID5 1/5/2012 5 FALSE FALSE FALSE TRUE
    UID1 1/6/2012 6 FALSE TRUE FALSE FALSE
    UID2 2/1/2012 7 TRUE FALSE FALSE
    UID3 2/2/2012 8 FALSE FALSE TRUE FALSE
    UID4 2/3/2012 9 FALSE FALSE TRUE FALSE
    UID5 2/4/2012 10 FALSE TRUE FALSE FALSE

    总观察次数超过1000万行,有大约10个不同的假期和4个不同的长度。



    对于第二类假人,我认为

      f < -  function(x){
    ifelse(x%in%as.Date(c(2012-01-02, 2012-02-02)) - 1,return(TRUE),return(FALSE))
    }

    DT [,Before:= f(date)]

    但似乎不正确。



    对于第一个,我没有提出一个好的解决方案。



    问题是关于data.table中的更新,任何关于如何处理它和如何编写更新函数的想法是非常受欢迎的!

    解决方案

    这里是一个开始:

      library(data.table)
    $ b b DT date = as.IDate(c(2012-01-01 2012-01-02,2012-01-03,2012-01-04,2012-01-05,2012-01-06,2012-02-01,2012- 02-02,2012-02-03,2012-02-04)),
    value = c(1:10))
    setkey(DT,date)


    vacStart< - data.table(start = as.IDate(c(2012-01-02,2012-02-02)),key =start)
    vacEnd< - data.table(date = as.IDate(c(2012-01-05,2012-02-03)),key =date)

    #identify vacations:
    vacStart [,Start:=。I]
    DT < - vacStart [DT,roll = TRUE]
    vacEnd [,End:=。I]
    DT < - vacEnd [DT,roll = -Inf]
    DT [,vac:=(End == Start)* Start]
    DT [is.na(vac),vac:= 0L]

    #2天假期:
    DT [,length_2:=(.N == 2)&假期前$ d $ b DT [,之前:= c(diff(vac)> 0,FALSE)& vac == 0]
    #date结束开始UID值vac length_2之前
    #1:2012-01-01 1 NA UID1 1 0 FALSE TRUE
    #2:2012-01-02 1 1 UID1 2 1 FALSE FALSE
    #3:2012-01-03 1 1 UID2 3 1 FALSE FALSE
    #4:2012-01-04 1 1 UID2 4 1 FALSE FALSE
    #5 :2012-01-05 1 1 UID3 5 1 FALSE FALSE
    #6:2012-01-06 2 1 UID3 6 0 FALSE FALSE
    #7:2012-02-01 2 1 UID4 7 0 FALSE TRUE
    #8:2012-02-02 2 2 UID4 8 2 TRUE FALSE
    #9:2012-02-03 2 2 UID5 9 2 TRUE FALSE
    #10:2012-02- 04 NA 2 UID5 10 0 FALSE FALSE


    Sorry for the long question. I will try my best to clarify my goal clealy

    I want to add dummies in data.table using the update method, just like this already answered in this link, but a little bit more complicated.

    For better description, I created the data.

    DT <- data.table(UID = paste0("UID",rep(1:5,each=2)), 
                     date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
                     value = c(1:10)) 
    

    The DT is a data.table containing the information of UID, date, and value. In the original data, the structure is just the same, but with long time span(2 years).

    Here I want to add dummies based on the date.

    there're several special time spans in date, we can just use vacations to represent them.

    For example, in the fake data I created above.

    There're two vacations

    1. From "2012-01-02" to "2012-01-05"
    2. From "2012-02-02" to "2012-02-03"

    I want to add 2 types of dummies

    1. Dummies about the length of vacation: First calculate the length from different vacations. In this example, we have two different lengths (2, and 4 ). So we'll add 2 dummies indicating whether the date are in these vacations.

    The expected result is like this:

    UID     Date    Val D_length_2  D_length_4
    UID1    1/1/2012    1   FALSE   FALSE
    UID2    1/2/2012    2   FALSE   TRUE
    UID3    1/3/2012    3   FALSE   TRUE
    UID4    1/4/2012    4   FALSE   TRUE
    UID5    1/5/2012    5   FALSE   TRUE
    UID1    1/6/2012    6   FALSE   FALSE
    UID2    2/1/2012    7   TRUE    FALSE
    UID3    2/2/2012    8   TRUE    FALSE
    UID4    2/3/2012    9   FALSE   FALSE
    UID5    2/4/2012    10  FALSE   FALSE
    

    1. Dummies about whether the day is exactly one day before the vacation, or exactly one day after the vacation.

    UID    Date      Val    Before  After
    UID1    1/1/2012    1   TRUE    FALSE
    UID2    1/2/2012    2   FALSE   FALSE
    UID3    1/3/2012    3   FALSE   FALSE
    UID4    1/4/2012    4   FALSE   FALSE
    UID5    1/5/2012    5   FALSE   FALSE
    UID1    1/6/2012    6   FALSE   TRUE
    UID2    2/1/2012    7   TRUE    FALSE
    UID3    2/2/2012    8   FALSE   FALSE
    UID4    2/3/2012    9   FALSE   FALSE
    UID5    2/4/2012    10  FALSE   TRUE
    

    So the total of desired results is like this

    UID Date    Val Before  After   D_length_2  D_length_4
    UID1    1/1/2012    1   TRUE    FALSE   FALSE   FALSE
    UID2    1/2/2012    2   FALSE   FALSE   FALSE   TRUE
    UID3    1/3/2012    3   FALSE   FALSE   FALSE   TRUE
    UID4    1/4/2012    4   FALSE   FALSE   FALSE   TRUE
    UID5    1/5/2012    5   FALSE   FALSE   FALSE   TRUE
    UID1    1/6/2012    6   FALSE   TRUE    FALSE   FALSE
    UID2    2/1/2012    7   TRUE    FALSE   FALSE   FALSE
    UID3    2/2/2012    8   FALSE   FALSE   TRUE    FALSE
    UID4    2/3/2012    9   FALSE   FALSE   TRUE    FALSE
    UID5    2/4/2012    10  FALSE   TRUE    FALSE   FALSE
    

    The total observations are more than 10M rows, with about 10 different vacations and 4 different length.

    For the second type of dummies, I think

    f <- function(x){ 
    ifelse(x %in% as.Date(c("2012-01-02","2012-02-02")) - 1, return(TRUE), return(FALSE))
    }
    
    DT[,Before:= f(date)] 
    

    But it seems not correct.

    For the first one, I didn't come up with a good solution.

    this problem is about the update in data.table, any thoughts about how to deal with it and how to write the update functions are extremely welcome!

    解决方案

    Here is a start:

    library(data.table)
    
    DT <- data.table(UID = paste0("UID",rep(1:5,each=2)), 
                     date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
                     value = c(1:10)) 
    setkey(DT, date)
    
    
    vacStart <- data.table(start = as.IDate(c("2012-01-02", "2012-02-02")), key="start")
    vacEnd <- data.table(date = as.IDate(c("2012-01-05", "2012-02-03")), key="date")
    
    #identify vacations:
    vacStart[, Start:=.I]
    DT <- vacStart[DT, roll=TRUE]
    vacEnd[, End:=.I]
    DT <- vacEnd[DT, roll=-Inf]
    DT[,vac:=(End==Start)*Start]
    DT[is.na(vac), vac:=0L]
    
    #2-day vacations:
    DT[,length_2 := (.N==2) & vac!=0, by=vac]
    #days before vacation
    DT[,before := c(diff(vac)>0, FALSE) & vac==0]
    #           date End Start  UID value vac length_2 before
    #  1: 2012-01-01   1    NA UID1     1   0    FALSE   TRUE
    #  2: 2012-01-02   1     1 UID1     2   1    FALSE  FALSE
    #  3: 2012-01-03   1     1 UID2     3   1    FALSE  FALSE
    #  4: 2012-01-04   1     1 UID2     4   1    FALSE  FALSE
    #  5: 2012-01-05   1     1 UID3     5   1    FALSE  FALSE
    #  6: 2012-01-06   2     1 UID3     6   0    FALSE  FALSE
    #  7: 2012-02-01   2     1 UID4     7   0    FALSE   TRUE
    #  8: 2012-02-02   2     2 UID4     8   2     TRUE  FALSE
    #  9: 2012-02-03   2     2 UID5     9   2     TRUE  FALSE
    # 10: 2012-02-04  NA     2 UID5    10   0    FALSE  FALSE
    

    这篇关于在data.table中添加具有条件的虚拟对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆