使用data.table选择正确的联接 [英] Selecting correct join with data.table
问题描述
我有三个数据表(实际的input
更大,性能也很重要,因此我必须使用 data.table ):
I have three data tables (the actual input
one is way bigger and performance matters, so I have to use data.table as much as I can):
input <- fread(" ID | T1 | T2 | T3 | DATE
ACC001 | 1 | 0 | 0 | 31/12/2016
ACC001 | 1 | 0 | 1 | 30/06/2017
ACC002 | 0 | 1 | 1 | 31/12/2016", sep = "|")
mevs <- fread(" DATE | INDEX_NAME | INDEX_VALUE
31/12/2016 | GDP | 1.05
30/06/2017 | GDP | 1.06
31/12/2017 | GDP | 1.07
30/06/2018 | GDP | 1.08
31/12/2016 | CPI | 0.02
30/06/2017 | CPI | 0.00
31/12/2017 | CPI | -0.01
30/06/2018 | CPI | 0.01 ", sep = "|")
time <- fread(" DATE
31/12/2017
30/06/2018 ", sep = "|")
有了这些,我需要实现两件事:
With those, I need to achieve 2 things:
-
将第二个dt(
mevs
)中的GDP
和CPI
值插入第一个(input
)中,以便根据T1
,, T3
,GDP
和CPI
.
Insert
GDP
andCPI
values from the second dt(mevs
) into the first one (input
), to make some calculations in the last column based onT1
,T2
,T3
,GDP
andCPI
.
对第三个dt(time
)中给出的时间间隔进行投影,将前一个间隔中的T1
,T2
和T3
值复制到相同的ID
中(因此ACC001如果存在,则将保留1, 0, 1
)(如果不存在,则用0
填充),并从相应的日期获取GDP
和CPI
.
Make a projection for the time intervals given in the third dt (time
), copying T1
, T2
and T3
values in the previous interval in the same ID
(so ACC001 ones would remain 1, 0, 1
) if it exists (filling them with 0
if it doesn't) and getting GDP
and CPI
from the corresponding dates.
为此,我正在使用以下代码:
For that, I'm using the following pieces of code:
ones <- input[, .N, by = ID][N == 1, ID]
input[, .SD[time, on = "DATE"], by = ID
][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
][, (2:4) := lapply(.SD, function(x) if (.BY %in% ones) replace(x, is.na(x), 0L) else zoo::na.locf(x) )
, by = ID, .SDcols = 2:4][]
这样做(感谢@Jaap)
Which does (thanks to @Jaap):
-
input[, .SD[time, on = "DATE"], by = ID]
为每个ID将时间data.table连接到其余列,从而扩展了data.table.
input[, .SD[time, on = "DATE"], by = ID]
joins for each ID the time data.table to the remaining columns, thus extending the data.table.
然后将扩展版本的mevs (dcast(mevs, DATE ~ INDEX_NAME))
连接到扩展的data.table.
A wide version of mevs (dcast(mevs, DATE ~ INDEX_NAME))
is then joined to the extended data.table.
最后,扩展数据表中的缺失值由zoo
包中的na.locf
函数填充.
Finally the missing values in the extended data.table are filled with the na.locf
-function from the zoo
package.
预期的输出将是:
ID T1 T2 T3 DATE GDP CPI
1: ACC001 1 0 0 31/12/2016 1.05 0.02
2: ACC001 1 0 1 30/06/2017 1.06 0.00
3: ACC001 1 0 1 31/12/2017 1.07 -0.01
4: ACC001 1 0 1 30/06/2018 1.08 0.01
5: ACC002 0 1 1 31/12/2016 1.05 0.02
6: ACC002 0 0 0 30/06/2017 1.06 0.00
7: ACC002 0 0 0 31/12/2017 1.07 -0.01
8: ACC002 0 0 0 30/06/2018 1.08 0.01
但是我得到的是:
ID T1 T2 T3 DATE GDP CPI
1: ACC001 NA NA NA 31/12/2017 1.07 -0.01
2: ACC001 NA NA NA 30/06/2018 1.08 0.01
3: ACC002 NA NA NA 31/12/2017 1.07 -0.01
4: ACC002 NA NA NA 30/06/2018 1.08 0.01
我几乎可以肯定,第一步中input
和time
之间的连接选择一定是错误的,但是我找不到解决方法.
I'm almost sure that it has to be a wrong join choice between input
and time
in the first step, but I can't find a workaround for this.
感谢大家的宝贵时间.
推荐答案
可能的解决方案:
times <- unique(rbindlist(list(time, as.data.table(unique(input$DATE))))
)[, DATE := as.Date(DATE, "%d/%m/%Y")][order(DATE)]
input[, DATE := as.Date(DATE, "%d/%m/%Y")]
mevs[, DATE := as.Date(DATE, "%d/%m/%Y")]
ones <- input[, .N, by = ID][N == 1, ID]
input[, .SD[times, on = "DATE"], by = ID
][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
][, (2:4) := lapply(.SD, function(x) if (.BY %in% ones) replace(x, is.na(x), 0L) else zoo::na.locf(x) )
, by = ID, .SDcols = 2:4][]
给出:
ID T1 T2 T3 DATE GDP CPI
1: ACC001 1 0 0 2016-12-31 1.05 0.02
2: ACC001 1 0 1 2017-06-30 1.06 0.00
3: ACC001 1 0 1 2017-12-31 1.07 -0.01
4: ACC001 1 0 1 2018-06-30 1.08 0.01
5: ACC002 0 1 1 2016-12-31 1.05 0.02
6: ACC002 0 0 0 2017-06-30 1.06 0.00
7: ACC002 0 0 0 2017-12-31 1.07 -0.01
8: ACC002 0 0 0 2018-06-30 1.08 0.01
这篇关于使用data.table选择正确的联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!