如何在data.table中写入累积计算 [英] How to write a cumulative calculation in data.table

查看:69
本文介绍了如何在data.table中写入累积计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要进行时间序列计算,其中每行中计算出的值取决于上一行中计算出的结果。我希望使用 data.table 的便利。实际的问题是水文模型-累积水平衡计算,在每个时间步长增加降雨,并减去作为当前水量函数的径流和蒸发量。数据集包括不同的盆地和情景(组)。在这里,我将使用一个更简单的问题说明。

I need to make a time-series calculation, where the value calculated in each row depends on the result calculated in the previous row. I am hoping to use the convenience of data.table. The actual problem is a hydrological model -- a cumulative water balance calculation, adding rainfall at each time step and subtracting runoff and evaporation as a function of the current water volume. The dataset includes different basins and scenarios (groups). Here I will use a simpler illustration of the problem.

对于每个时间步(行) i

A simplified example of the calculation looks like this, for each time step (row) i:

 v[i] <- a[i] + b[i] * v[i-1]

a b 是参数值的向量,而 v 是结果向量。对于第一行( i == 1 ), v 的初始值为 v0 = 0

a and b are vectors of parameter values, and v is the result vector. For the first row (i == 1) the initial value of v is taken as v0 = 0.

我的第一个想法是使用 shift() data.table 中。包括所需结果 v.ans 的一个最小示例是

My first thought was to use shift() in data.table. A minimal example, including the desired result v.ans, is

library(data.table)        # version 1.9.7
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321) )
DT
#    a   b v.ans
# 1: 1 0.1 1.000
# 2: 2 0.1 2.100
# 3: 3 0.1 3.210
# 4: 4 0.1 4.321

DT[, v := NA]   # initialize v
DT[, v := a + b * ifelse(is.na(shift(v)), 0, shift(v))][]
#    a   b v.ans v
# 1: 1 0.1 1.000 1
# 2: 2 0.1 2.100 2
# 3: 3 0.1 3.210 3
# 4: 4 0.1 4.321 4

这不起作用,因为 shift(v)给出原始列 v 的副本,每列移动1行。不受分配给 v 的影响。

This doesn't work, because shift(v) gives a copy of the original column v, shifted by 1 row. It is unaffected by assignment to v.

我也考虑过使用cumsum()和cumprod()构造方程式,但最终获得了胜利

I also considered building the equation using cumsum() and cumprod(), but that won't work either.

因此,为了方便起见,我在函数内部使用了for循环:

So I resort to a for loop inside a function for convenience:

vcalc <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))      # initialize v
  for (i in 1:length(a)) {
    v[i] <- a[i] + b[i] * ifelse(i==1, v0, v[i-1])
  }
  return(v)
}

此累积函数可以很好地处理数据。表格:

This cumulative function works fine with data.table:

DT[, v := vcalc(a, b, 0)][]
#    a   b v.ans     v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
identical(DT$v, DT$v.ans)
# [1] TRUE


我的问题


我的问题是,我可以用更简洁有效的 data.table 方式编写此计算,而不必这些for循环和/或函数定义?也许使用 set()

My question

My question is, can I write this calculation in a more concise and efficient data.table way, without having to use the for loop and/or function definition? Using set() perhaps?

还是一起使用更好的方法?

Or is there a better approach all together?

下面大卫的Rcpp解决方案启发了我从中删除​​ ifelse() 循环:

David's Rcpp solution below inspired me to remove the ifelse() from the for loop:

vcalc2 <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))
  for (i in 1:length(a)) {
    v0 <- v[i] <- a[i] + b[i] * v0
  }
  return(v)
}

vcalc2() vcalc()快60%。

推荐答案

它可能不是您要查找的100%,因为它不使用 data.table-way,仍然使用for循环。但是,这种方法应该更快(我假设您想使用data.table和data.table-way来加快代码速度)。我利用Rcpp编写了一个称为 HydroFun 的简短函数,该函数可以像其他函数一样在R中使用(您只需要先获取该函数)。我的直觉告诉我data.table的方式(如果存在)非常复杂,因为您无法计算出封闭形式的解决方案(但是我可能在这一点上是错的...)。

It may not be 100% what you are looking for, as it does not use the "data.table-way" and still uses a for-loop. However, this approach should be faster (I assume you want to use data.table and the data.table-way to speed up your code). I leverage Rcpp to write a short function called HydroFun, that can be used in R like any other function (you just need to source the function first). My gut-feeling tells me that the data.table way (if existent) is pretty complicated because you cannot compute a closed-form solution (but I may be wrong on this point...).

我的方法如下:

Rcpp函数如下所示(在文件中: hydrofun.cpp ):

The Rcpp function looks like this (in the file: hydrofun.cpp):

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector HydroFun(NumericVector a, NumericVector b, double v0 = 0.0) {
  // get the size of the vectors
  int vecSize = a.length();

  // initialize a numeric vector "v" (for the result)
  NumericVector v(vecSize);

   // compute v_0
  v[0] = a[0] + b[0] * v0;

  // loop through the vector and compute the new value
  for (int i = 1; i < vecSize; ++i) {
    v[i] = a[i] + b[i] * v[i - 1];
  }
  return v;
}

要获取和使用R中的函数,您可以执行以下操作:

To source and use the function in R you can do:

Rcpp::sourceCpp("hydrofun.cpp")

library(data.table)
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321))

DT[, v_ans2 := HydroFun(a, b, 0)]
DT
# a   b v.ans v_ans2
# 1: 1 0.1 1.000  1.000
# 2: 2 0.1 2.100  2.100
# 3: 3 0.1 3.210  3.210
# 4: 4 0.1 4.321  4.321

比较速度可以发现速度提高了大约65倍。

Comparing the speeds reveals a speed-up of roughly 65x.

library(microbenchmark)
n <- 10000
dt <- data.table(a = 1:n,
                 b = rnorm(n))

microbenchmark(dt[, v1 := vcalc(a, b, 0)],
               dt[, v2 := HydroFun(a, b, 0)])
# Unit: microseconds
# expr                                min        lq       mean    median         uq       max neval
# dt[, `:=`(v1, vcalc(a, b, 0))]    28369.672 30203.398 31883.9872 31651.566 32646.8780 68727.433   100
# dt[, `:=`(v2, HydroFun(a, b, 0))]   381.307   421.697   512.2957   512.717   560.8585  1496.297   100

identical(dt$v1, dt$v2)
# [1] TRUE

对您有任何帮助吗?

这篇关于如何在data.table中写入累积计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆