data.table:子集并查找每一行的累积乘积 [英] data.table: Subset and find cumulative product for each row

查看:52
本文介绍了data.table:子集并查找每一行的累积乘积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含三列的简单数据框:一个id,一个日期和一个值.现在,我要根据此过程中的以下三列来计算新值newValue:

I have a simple dataframe containing three columns: An id, a date and a value. Now, I want to calculate a new value, newValue, based on these three columns following this procedure:

  1. 针对每一行(即针对每一对(id,日期))
  2. 对于范围(日期,日期+2)中的所有日期,我想查找该id的值的累积乘积(然后减去1)

下面的简单数字示例由虚假数字进行计算:

The simple example below with made-up numbers does the computation:

df <- data.frame("id"=rep(1:10, 5),
                 "date"=c(rep(2000, 10), rep(2001, 10), rep(2002, 10), rep(2003, 10), rep(2004, 10)),
                 "value"=c(rep(1, 10), rep(2, 10), rep(3, 10), rep(4, 10), rep(5, 10)))

df$newValue <- 1 #initialize

for(idx in 1:dim(df)[1]) {
  id <- df[idx, "id"]
  lower <- df[idx, "date"]
  upper <- lower + 3
  
  df[idx, "newValue"] <- prod(df[(df$id == id) & (df$date >= lower) & (df$date < upper), ]$value + 1) - 1
}

这给了我输出(为简单起见,我已对其进行了注释):

This gives me the output (I have annotated it for simplicity):

   id date value newValue
1   1 2000     1       23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
2   2 2000     1       23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
....
12  2 2001     2       59 (= (2+1) * (3+1) * (4+1) - 1 = 59)
....
22  2 2002     3      119 (= (3+1) * (4+1) * (5+1) - 1 = 119)
....

但是,我的最终数据帧有+1百万行,因此上面的代码非常耗时且效率低.

However, my final dataframe has +1million rows, so the code above is very time-consuming and inefficient.

是否有一种方法可以加快速度,也许使用data.table?请注意,每个id可能具有不同数量的行,因此我为什么要显式地设置子集.

Is there a way to speed it up, perhaps using a data.table? Note that each id may have a different number of rows, so that I why I explicitly subset.

推荐答案

library(data.table)
library(purrr)

setDT(df)[, newValue := map_dbl(date, ~prod(value[between(date, .x, .x + 2)] + 1) - 1), by = id]

给出(仅显示 id = 1 ):

     id date value newValue
 1:  1 2000     1       23
 2:  1 2001     2       59
 3:  1 2002     3      119
 4:  1 2003     4       29
 5:  1 2004     5        5

更新:

因为每个 id 中的每个 date 最多一次,所以应该更有效:

because every date is at most once in each id this should be more efficient:

df <- setDT(df)[order(id, date)]

df[, 
  newValue := map2_dbl(
    date, map(seq_len(.N), ~.x:min(.x+2, .N)), 
    ~prod(value[.y][between(date[.y], .x, .x + 2)] + 1) - 1
  ), 
  by = id
]

如果您想要除 2 以外的其他数字,则可以创建一些varialbe date_range 并将 2 替换为 date_range

if you want some other number than 2 you can create some varialbe date_range and replace 2 with date_range

这篇关于data.table:子集并查找每一行的累积乘积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆