连接后data.table复制行? [英] data.table replicate rows after join?

查看:124
本文介绍了连接后data.table复制行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定我是否正确,因此,我将首先展示我要解决的问题,然后再展示我要解决的问题.随时告诉我我有多严重,以及您能想到的任何更好的方法.

I'm not sure I'm facing this right, so first I'll try to show the problem I'm trying to solve and then the way I'm trying to do so. Feel free to tell me how wrong I am, and any better approach you can think of.

我有三个data.tables(实际的输入"一个更大,性能很重要,因此我必须使用):

I have three data.tables (the actual "input" one is way bigger and performance matters, so I have to use data.table as much as I can):

输入:

+--------+----+----+----+------------+
|   ID   | T1 | T2 | T3 |    DATE    | 
+--------+----+----+----+------------+
| ACC001 |  1 |  0 |  0 | 31/12/2016 |
| ACC001 |  1 |  0 |  1 | 30/06/2017 |
| ACC002 |  0 |  1 |  1 | 31/12/2016 |
| ACC002 |  0 |  1 |  1 | 30/06/2017 |
+--------+----+----+----+------------+

重要程度:

+------------+------------+-------------+
|    DATE    | INDEX_NAME | INDEX_VALUE |
+------------+------------+-------------+
| 31/12/2016 | GDP        |  1.05       |
| 30/06/2017 | GDP        |  1.06       |
| 31/12/2017 | GDP        |  1.07       |
| 30/06/2018 | GDP        |  1.08       |
| 31/12/2016 | CPI        |  0.02       |
| 30/06/2017 | CPI        |  0.00       |
| 31/12/2017 | CPI        | -0.01       |
| 30/06/2018 | CPI        |  0.01       |
+------------+------------+-------------+   

时间:

+------------+
|    DATE    |
+------------+
| 31/12/2016 |
| 30/06/2017 |
| 31/12/2017 |
| 30/06/2018 |
+------------+

有了这些,我需要实现两件事:

With those, I need to achieve 2 things:

  • 将第二个dt(mevs)中的GDP和CPI值插入第一个dt(mevs)中,以便根据T1,T2,T3,GDP和CPI在最后一列中进行一些计算.

  • Insert GDP and CPI values from the second dt(mevs) into the first one (input), to make some calculations in the last column based on T1, T2, T3, GDP and CPI.

对第三个dt(时间)中给出的时间间隔进行投影,将前一个间隔中的T1,T2和T3值复制到相同的ID中(因此ACC001的值将保持(1、0、1) ),然后从相应的日期获取GDP和CPI.最终的计算将使用相同的函数完成.

Make a projection for the time intervals given in the third dt (time), copying T1, T2 and T3 values in the previous interval in the same ID (so ACC001 ones would remain (1, 0, 1)) and getting GDP and CPI from the corresponding dates. The final calculation would be done using the same function.

应该导致这样的输入" dt:

Which should result in an "input" dt like this:

+--------+----+----+----+------------+------+-------+------+
| ID     | T1 | T2 | T3 | DATE       | GDP  | CPI   | CALC |
+--------+----+----+----+------------+------+-------+------+
| ACC001 | 1  | 0  | 0  | 31/12/2016 | 1.05 | 0.02  | fun  |
| ACC001 | 1  | 0  | 1  | 30/06/2017 | 1.06 | 0.00  | fun  |
| ACC001 | 1  | 0  | 1  | 31/12/2017 | 1.07 | -0.01 | fun  |
| ACC001 | 1  | 0  | 1  | 30/06/2018 | 1.08 | 0.01  | fun  |
| ACC002 | 0  | 1  | 1  | 31/12/2016 | 1.05 | 0.02  | fun  |
| ACC002 | 0  | 1  | 1  | 30/06/2017 | 1.06 | 0.00  | fun  |
| ACC002 | 0  | 1  | 1  | 31/12/2017 | 1.07 | -0.01 | fun  |
| ACC002 | 0  | 1  | 1  | 30/06/2018 | 1.08 | 0.01  | fun  |
+--------+----+----+----+------------+------+-------+------+

我设法做到的事情:

  • mevs <- mevs %>% tidyr::spread(INDEX_NAME, INDEX_VALUE)将索引值放入列中.
  • input[mevs, ':=' (GDP = i.GDP, CPI = i.CPI), on = "RUN_DATE"]设置索引值(如果我没记错的话,请避免赋值).
  • mevs <- mevs %>% tidyr::spread(INDEX_NAME, INDEX_VALUE) to get indexes values into columns.
  • input[mevs, ':=' (GDP = i.GDP, CPI = i.CPI), on = "RUN_DATE"] to set indexes values (avoiding assignations, if I'm not mistaken).

结果:

+--------+----+----+----+------------+------+------+------+
| ID     | C1 | C2 | C3 | DATE       | GDP  | CPI  | CALC |
+--------+----+----+----+------------+------+------+------+
| ACC001 | 1  | 0  | 0  | 31/12/2016 | 1.05 | 0.02 | fun  |
| ACC001 | 1  | 0  | 1  | 30/06/2017 | 1.06 | 0    | fun  |
| ACC002 | 0  | 1  | 1  | 31/12/2016 | 1.05 | 0.02 | fun  |
| ACC002 | 0  | 1  | 1  | 30/06/2017 | 1.06 | 0    | fun  |
+--------+----+----+----+------------+------+------+------+

我不知道该怎么做:

我正在尝试使用基于"DATE"的"input"-"time"进行正确的外部联接(在我在做什么"的第二步的选择性联接"之前),并使用以下代码:input <- input[time, on = "DATE"].但是,这不仅不能正常工作(我在下一步需要的ID列中获得了NA),而且还迫使我进行分配.

I'm trying to do a right outer join (before the "selective join" on the second step at "What I'm doing") with "input"-"time" based on "DATE" with the following code: input <- input[time, on = "DATE"]. But not only it doesn't work properly (I get NAs in the ID column, which I needed for the next step), it also forces me to make an assignment.

在那之后,我计划基于"ID"使用"input"-"input"进行另一个联接,但是显然我不能,因为在这些新行中没有任何ID值:

After that, I was planning on doing another join with "input"-"input" based on "ID", but obviously I can't since I don't have any ID value in those new rows:

+--------+----+----+----+------------+
| ID     | T1 | T2 | T3 | DATE       |
+--------+----+----+----+------------+
| ACC001 | 1  | 0  | 0  | 31/12/2016 |
| ACC001 | 1  | 0  | 1  | 30/06/2017 |
| NA     | NA | NA | NA | 31/12/2017 |
| NA     | NA | NA | NA | 30/06/2018 |
| ACC002 | 0  | 1  | 1  | 31/12/2016 |
| ACC002 | 0  | 1  | 1  | 30/06/2017 |
| NA     | NA | NA | NA | 31/12/2017 |
| NA     | NA | NA | NA | 30/06/2018 |
+--------+----+----+----+------------+

例如,是否有任何方法可以根据DATE列上的某些条件来复制这些ID?如果不是,您是否知道其他解决方案,也许基于rbindlist?

Is there any way to, for example, replicate those IDs based on some condition on the DATE column? If not, do you know any other solution, maybe based on rbindlist?

非常感谢您能做到这一点.任何建议将不胜感激!

Thanks so much for making this far. Any advice will be highly appreciated!

避免分配

@Jaap的解决方案因此返回了所需的data.table.我需要将输入转换为最后一个data.table,除非不可避免,否则不使用标准分配(<-).在这种情况下怎么办?

@Jaap 's solution returns the desired data.table, thanks to that. I'd need to turn input into that last data.table, without using the standard assigment (<-) unless unavoidable. How can this be done in this case?

条件

我需要在票据的最后部分引入特殊性.如果在投影之前有一个没有任何注册表的ID,则T1/T2/T3在投影中必须为0.此处就是ACC002,在2016年12月31日之后没有注册机构:

I'd need to introduce a particularity to the last part of the scrip. If there's an ID without some registries before the projections, T1/T2/T3 have to be 0 in the projections. That would be the case of ACC002 here, which doesn't have registries beyond 31/12/2016:

input <- fread("  ID   | T1 | T2 | T3 |    DATE    
                ACC001 |  1 |  0 |  0 | 31/12/2016 
                ACC001 |  1 |  0 |  1 | 30/06/2017 
                ACC002 |  0 |  1 |  1 | 31/12/2016", sep = "|")

那最终应该变成:

+--------+----+----+----+------------+------+-------+------+
| ID     | T1 | T2 | T3 | DATE       | GDP  | CPI   | CALC |
+--------+----+----+----+------------+------+-------+------+
| ACC001 | 1  | 0  | 0  | 31/12/2016 | 1.05 | 0.02  | fun  |
| ACC001 | 1  | 0  | 1  | 30/06/2017 | 1.06 | 0.00  | fun  |
| ACC001 | 1  | 0  | 1  | 31/12/2017 | 1.07 | -0.01 | fun  |
| ACC001 | 1  | 0  | 1  | 30/06/2018 | 1.08 | 0.01  | fun  |
| ACC002 | 0  | 1  | 1  | 31/12/2016 | 1.05 | 0.02  | fun  |
| ACC002 | 0  | 0  | 0  | 30/06/2017 | 1.06 | 0.00  | fun  |
| ACC002 | 0  | 0  | 0  | 31/12/2017 | 1.07 | -0.01 | fun  |
| ACC002 | 0  | 0  | 0  | 30/06/2018 | 1.08 | 0.01  | fun  |
+--------+----+----+----+------------+------+-------+------+

实际的最终结论是,在这种情况下,依赖于T1/T2/T3多项式的CALC列等于0(以防万一,您可以直接从那里求近).

The actual finality of that is that the CALC column, which relies on a T1/T2/T3 dependant polynomial, is equal to 0 in that situation (in case you find it better to approach directly from there).

推荐答案

使用:

input[, .SD[time, on = "DATE"], by = ID
      ][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
        ][, (2:4) := lapply(.SD, zoo::na.locf), by = ID, .SDcols = 2:4][]

给予:

       ID T1 T2 T3       DATE  GDP   CPI
1: ACC001  1  0  0 31/12/2016 1.05  0.02
2: ACC001  1  0  1 30/06/2017 1.06  0.00
3: ACC001  1  0  1 31/12/2017 1.07 -0.01
4: ACC001  1  0  1 30/06/2018 1.08  0.01
5: ACC002  0  1  1 31/12/2016 1.05  0.02
6: ACC002  0  1  1 30/06/2017 1.06  0.00
7: ACC002  0  1  1 31/12/2017 1.07 -0.01
8: ACC002  0  1  1 30/06/2018 1.08  0.01

这是做什么的

  • input[, .SD[time, on = "DATE"], by = ID]time data.table的每个ID联接到其余列,从而扩展了data.table.
  • 然后将扩展版本的mevs(dcast(mevs, DATE ~ INDEX_NAME))连接到扩展的data.table.
  • 最后,扩展数据.表中的缺失值由
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆