使用data.table选择正确的联接 [英] Selecting correct join with data.table

查看:96
本文介绍了使用data.table选择正确的联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题的后续行动.

我有三个数据表(实际的input更大,性能也很重要,因此我必须使用):

I have three data tables (the actual input one is way bigger and performance matters, so I have to use data.table as much as I can):

input <- fread("  ID   | T1 | T2 | T3 |    DATE    
                ACC001 |  1 |  0 |  0 | 31/12/2016 
                ACC001 |  1 |  0 |  1 | 30/06/2017 
                ACC002 |  0 |  1 |  1 | 31/12/2016", sep = "|")

mevs <- fread("  DATE    | INDEX_NAME | INDEX_VALUE 
              31/12/2016 | GDP        |  1.05       
              30/06/2017 | GDP        |  1.06       
              31/12/2017 | GDP        |  1.07       
              30/06/2018 | GDP        |  1.08       
              31/12/2016 | CPI        |  0.02       
              30/06/2017 | CPI        |  0.00       
              31/12/2017 | CPI        | -0.01       
              30/06/2018 | CPI        |  0.01   ", sep = "|")

time <- fread("    DATE   
               31/12/2017 
               30/06/2018 ", sep = "|")

有了这些,我需要实现两件事:

With those, I need to achieve 2 things:

  • 将第二个dt(mevs)中的GDPCPI值插入第一个(input)中,以便根据T1T3GDPCPI.

  • Insert GDP and CPI values from the second dt(mevs) into the first one (input), to make some calculations in the last column based on T1, T2, T3, GDP and CPI.

对第三个dt(time)中给出的时间间隔进行投影,将前一个间隔中的T1T2T3值复制到相同的ID中(因此ACC001如果存在,则将保留1, 0, 1)(如果不存在,则用0填充),并从相应的日期获取GDPCPI.

Make a projection for the time intervals given in the third dt (time), copying T1, T2 and T3 values in the previous interval in the same ID (so ACC001 ones would remain 1, 0, 1) if it exists (filling them with 0 if it doesn't) and getting GDP and CPI from the corresponding dates.

为此,我正在使用以下代码:

For that, I'm using the following pieces of code:

ones <- input[, .N, by = ID][N == 1, ID]

input[, .SD[time, on = "DATE"], by = ID
      ][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
        ][, (2:4) := lapply(.SD, function(x) if (.BY %in% ones) replace(x, is.na(x), 0L) else zoo::na.locf(x) )
          , by = ID, .SDcols = 2:4][]

这样做(感谢@Jaap)

Which does (thanks to @Jaap):

  • input[, .SD[time, on = "DATE"], by = ID]为每个ID将时间data.table连接到其余列,从而扩展了data.table.

  • input[, .SD[time, on = "DATE"], by = ID] joins for each ID the time data.table to the remaining columns, thus extending the data.table.

然后将扩展版本的mevs (dcast(mevs, DATE ~ INDEX_NAME))连接到扩展的data.table.

A wide version of mevs (dcast(mevs, DATE ~ INDEX_NAME)) is then joined to the extended data.table.

最后,扩展数据表中的缺失值由zoo包中的na.locf函数填充.

Finally the missing values in the extended data.table are filled with the na.locf-function from the zoo package.

预期的输出将是:

       ID T1 T2 T3       DATE  GDP   CPI
1: ACC001  1  0  0 31/12/2016 1.05  0.02
2: ACC001  1  0  1 30/06/2017 1.06  0.00
3: ACC001  1  0  1 31/12/2017 1.07 -0.01
4: ACC001  1  0  1 30/06/2018 1.08  0.01
5: ACC002  0  1  1 31/12/2016 1.05  0.02
6: ACC002  0  0  0 30/06/2017 1.06  0.00
7: ACC002  0  0  0 31/12/2017 1.07 -0.01
8: ACC002  0  0  0 30/06/2018 1.08  0.01

但是我得到的是:

       ID T1 T2 T3       DATE  GDP   CPI
1: ACC001 NA NA NA 31/12/2017 1.07 -0.01
2: ACC001 NA NA NA 30/06/2018 1.08  0.01
3: ACC002 NA NA NA 31/12/2017 1.07 -0.01
4: ACC002 NA NA NA 30/06/2018 1.08  0.01

我几乎可以肯定,第一步中inputtime之间的连接选择一定是错误的,但是我找不到解决方法.

I'm almost sure that it has to be a wrong join choice between input and time in the first step, but I can't find a workaround for this.

感谢大家的宝贵时间.

推荐答案

可能的解决方案:

times <- unique(rbindlist(list(time, as.data.table(unique(input$DATE))))
                )[, DATE := as.Date(DATE, "%d/%m/%Y")][order(DATE)]
input[, DATE := as.Date(DATE, "%d/%m/%Y")]
mevs[, DATE := as.Date(DATE, "%d/%m/%Y")]

ones <- input[, .N, by = ID][N == 1, ID]

input[, .SD[times, on = "DATE"], by = ID
      ][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
        ][, (2:4) := lapply(.SD, function(x) if (.BY %in% ones) replace(x, is.na(x), 0L) else zoo::na.locf(x) )
          , by = ID, .SDcols = 2:4][]

给出:

       ID T1 T2 T3       DATE  GDP   CPI
1: ACC001  1  0  0 2016-12-31 1.05  0.02
2: ACC001  1  0  1 2017-06-30 1.06  0.00
3: ACC001  1  0  1 2017-12-31 1.07 -0.01
4: ACC001  1  0  1 2018-06-30 1.08  0.01
5: ACC002  0  1  1 2016-12-31 1.05  0.02
6: ACC002  0  0  0 2017-06-30 1.06  0.00
7: ACC002  0  0  0 2017-12-31 1.07 -0.01
8: ACC002  0  0  0 2018-06-30 1.08  0.01

这篇关于使用data.table选择正确的联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆