data.table:子集并查找每一行的累积乘积 [英] data.table: Subset and find cumulative product for each row
问题描述
我有一个包含三列的简单数据框:一个id,一个日期和一个值.现在,我要根据此过程中的以下三列来计算新值newValue:
I have a simple dataframe containing three columns: An id, a date and a value. Now, I want to calculate a new value, newValue, based on these three columns following this procedure:
- 针对每一行(即针对每一对(id,日期))
- 对于范围(日期,日期+2)中的所有日期,我想查找该id的值的累积乘积(然后减去1)
下面的简单数字示例由虚假数字进行计算:
The simple example below with made-up numbers does the computation:
df <- data.frame("id"=rep(1:10, 5),
"date"=c(rep(2000, 10), rep(2001, 10), rep(2002, 10), rep(2003, 10), rep(2004, 10)),
"value"=c(rep(1, 10), rep(2, 10), rep(3, 10), rep(4, 10), rep(5, 10)))
df$newValue <- 1 #initialize
for(idx in 1:dim(df)[1]) {
id <- df[idx, "id"]
lower <- df[idx, "date"]
upper <- lower + 3
df[idx, "newValue"] <- prod(df[(df$id == id) & (df$date >= lower) & (df$date < upper), ]$value + 1) - 1
}
这给了我输出(为简单起见,我已对其进行了注释):
This gives me the output (I have annotated it for simplicity):
id date value newValue
1 1 2000 1 23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
2 2 2000 1 23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
....
12 2 2001 2 59 (= (2+1) * (3+1) * (4+1) - 1 = 59)
....
22 2 2002 3 119 (= (3+1) * (4+1) * (5+1) - 1 = 119)
....
但是,我的最终数据帧有+1百万行,因此上面的代码非常耗时且效率低.
However, my final dataframe has +1million rows, so the code above is very time-consuming and inefficient.
是否有一种方法可以加快速度,也许使用data.table?请注意,每个id可能具有不同数量的行,因此我为什么要显式地设置子集.
Is there a way to speed it up, perhaps using a data.table? Note that each id may have a different number of rows, so that I why I explicitly subset.
推荐答案
library(data.table)
library(purrr)
setDT(df)[, newValue := map_dbl(date, ~prod(value[between(date, .x, .x + 2)] + 1) - 1), by = id]
给出(仅显示 id = 1
):
id date value newValue
1: 1 2000 1 23
2: 1 2001 2 59
3: 1 2002 3 119
4: 1 2003 4 29
5: 1 2004 5 5
更新:
因为每个 id
中的每个 date
最多一次,所以应该更有效:
because every date
is at most once in each id
this should be more efficient:
df <- setDT(df)[order(id, date)]
df[,
newValue := map2_dbl(
date, map(seq_len(.N), ~.x:min(.x+2, .N)),
~prod(value[.y][between(date[.y], .x, .x + 2)] + 1) - 1
),
by = id
]
如果您想要除 2
以外的其他数字,则可以创建一些varialbe date_range
并将 2
替换为 date_range
if you want some other number than 2
you can create some varialbe date_range
and replace 2
with date_range
这篇关于data.table:子集并查找每一行的累积乘积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!