如何使用[r] data.table执行与先前行相关的多个逐行操作(如果可能) [英] How to perform multiple row-wise operations with dependency with previous rows using [r] data.table (if possible)

查看:115
本文介绍了如何使用[r] data.table执行与先前行相关的多个逐行操作(如果可能)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据表:

dt <- fread("
  ID   | EO_1 | EO_2 | EO_3 | GROUP
ID_001 | 0.5  |  1.2 |      |   A  
ID_002 |      |      |      |   A
ID_003 |      |      |      |   A
ID_004 |      |      |      |   A
ID_001 | 0.4  |  2.5 |      |   B
ID_002 |      |      |      |   B
ID_003 |      |      |      |   B
ID_004 |      |      |      |   B  
            ", 
            sep = "|",
            colClasses = c("character", "numeric", "numeric", "numeric", "character"))

,而我正在尝试执行一些按行操作,这些操作有时取决于前几行的数据。更具体地说:

and I'm trying to perform some row-wise operations, which sometimes depend on data from previous rows. More specifically:

calc_EO_1 <- function(
  EO_1,
  EO_2
){
  EO_1 <- shift(EO_1, type = "lag") * shift(EO_2, type = "lag")
  return(EO_1)
}

calc_EO_2 <- function(
  EO_1,
  EO_2,
  EO_3
){
  EO_2 <- EO_1 * shift(EO_2, type = "lag") * shift(EO_3, type = "lag")
  return(EO_2)
}

calc_EO_3 <- function(
  EO_1,
  EO_2
){
  EO_3 <- EO_1 * EO_2
  return(EO_3)
}

最后一个需要从第一行开始计算,因为它取决于其他字段(应该很容易),在那之后,所有三个操作都必须连续且逐行进行。

The last one would need to be calculated from the first row since it depends on the other fields (that should be easy) and, after that, all three operations would have to take place consecutively and row-wise.

我去过最近的是:

first_row_bygroup_index <- dt[, .I[1], by = GROUP]$V1

dt[first_row_bygroup_index, 
   EO_3 := calc_EO_3(EO_1, EO_2)
     ]

dt[!first_row_bygroup_index, 
   `:=` (
     EO_1 = calc_EO_1(EO_1, EO_2),
     EO_2 = calc_EO_2(EO_1, EO_2, EO_3),
     EO_3 = calc_EO_3(EO_1, EO_2)
     ),
   by = row.names(dt[!first_row_bygroup_index])]

,但只能正确计算第一行:

but it only calculates the first row properly:

  ID   | EO_1 | EO_2 | EO_3 | GROUP
ID_001 | 0.5  |  1.2 |  0.6 |   A  
ID_002 |      |      |      |   A
ID_003 |      |      |      |   A
ID_004 |      |      |      |   A
ID_001 | 0.4  |  2.5 |  1.0 |   B
ID_002 |      |      |      |   B
ID_003 |      |      |      |   B
ID_004 |      |      |      |   B  

成为那些空格NAs。

我认为我离解决方案不太远,但是我找不到找到使之可行的方法。问题是我无法使用子集外部的行在行子集中执行操作。

I don't think I'm too far away from the solution, but I'm not able to find a way to make it work. The problem is that I can't perform operations in subsets of rows using rows from outside the subset.

EDIT
我错过了预期结果:

EDIT I missed the expected result:

  ID   |   EO_1      |     EO_2      |       EO_3      | GROUP
ID_001 |  0.50000000 |   1.20000000  |      0.60000000 |   A  
ID_002 |  0.60000000 |   0.43200000  |      0.25920000 |   A
ID_003 |  0.25920000 |   0.02902376  |      0.00752296 |   A
ID_004 |  0.00752296 |   0.00000164  |      0.00000001 |   A
ID_001 |  0.40000000 |   2.50000000  |      1.00000000 |   B
ID_002 |  1.00000000 |   2.50000000  |      2.50000000 |   B
ID_003 |  2.50000000 |  15.62500000  |     39.06250000 |   B
ID_004 | 39.06250000 | 23841.8580000 | 931322.57810000 |   B   

NEW EDIT
我想出了以下代码段,但我宁愿稍等一下,看看是否有人可以获得比此解决方案更有效的解决方案:

NEW EDIT I came up with the following snippet, but I would rather wait a bit to see if someone can get a more efficient solution than this one:

while(any(is.na(dt))){
  dt[, `:=` (
    EO_3 = calc_EO_3(EO_1, EO_2),
    EO_1 = ifelse(ID == "ID_001", EO_1, calc_EO_1(EO_1, EO_2)),
    EO_2 = ifelse(ID == "ID_001", EO_2, calc_EO_2(EO_1, EO_2, EO_3))
  )]  
}

我想出了一个类似的dplyr解决方案,同时也提供了一个难看的while循环修复程序。关键是找到一种进行按行计算的方法,该方法可以从前一行获取信息,即使该行位于所选子集之外。我希望有人可以改善它,所以我将稍等一下,然后将其标记为解决方案。

I've come up with a similar dplyr solution, with that ugly while-loop fix as well. The key would be to find a way to make a rowwise calculation that could get info from the row before, even though that row before would outside of the subset selected. I hope someone can improve this, so I'll wait a little bit before marking it as a solution.

推荐答案

可能的方法:

dt[!is.na(EO_1), EO_3 := EO_1 * EO_2, by=.(GROUP)]
dt[ID!="ID_001", c("EO_1", "EO_2", "EO_3") :=
    dt[,
        {
            eo1 <- EO_1[1L]; eo2 <- EO_2[1L]; eo3 <- EO_3[1L]
            .SD[ID!="ID_001",
                {
                    eo1 <- eo1 * eo2
                    eo2 <- eo1 * eo2 * eo3
                    eo3 <- eo1 * eo2
                    .(eo1, eo2, eo3)
                },
                by=.(ID)]
        },
        by=.(GROUP)][, -1L:-2L]
]

输出:

       ID        EO_1         EO_2         EO_3 GROUP
1: ID_001  0.50000000 1.200000e+00 6.000000e-01     A
2: ID_002  0.60000000 4.320000e-01 2.592000e-01     A
3: ID_003  0.25920000 2.902376e-02 7.522960e-03     A
4: ID_004  0.00752296 1.642598e-06 1.235720e-08     A
5: ID_001  0.40000000 2.500000e+00 1.000000e+00     B
6: ID_002  1.00000000 2.500000e+00 2.500000e+00     B
7: ID_003  2.50000000 1.562500e+01 3.906250e+01     B
8: ID_004 39.06250000 2.384186e+04 9.313226e+05     B

这篇关于如何使用[r] data.table执行与先前行相关的多个逐行操作(如果可能)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆