以最快,最有效的方式将大型data.frame的行作为R中的函数的参数传递 [英] Fastest and most efficient way to pass rows of a large data.frame as arguments to a function in R

查看:36
本文介绍了以最快,最有效的方式将大型data.frame的行作为R中的函数的参数传递的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在data.frame myDF 中建立一个新列,该列是函数 getval 将每一行中的元素作为返回的每一行的值论点. getval 还使用外部向量 v1 作为参数.例如:

I would like to build a new column in a data.frame myDF which is the value returned for each row by a function getval taking the elements in this row as arguments. getval also uses an external vector v1 as argument. For example:

myn = 1000
a = seq(0, 1, length.out = myn)
b = seq(-1, 1, length.out = myn)
myDF = expand.grid(a=a, b=b)

set.seed(13)
v1 = rnorm(100)

getval = function(a, b, v) {
  return(sum(a*v + b/2*v))
}

myDF$val = apply(myDF, 1, function(x) {getval(a=x[1], b=x[2], v=v1)})
head(myDF)
#             a  b      val
# 1 0.000000000 -1 3.091267
# 2 0.001001001 -1 3.085078
# 3 0.002002002 -1 3.078889
# 4 0.003003003 -1 3.072700
# 5 0.004004004 -1 3.066512
# 6 0.005005005 -1 3.060323

但这太慢了(这里是〜4秒,但是对于更高的 myn 会增加很多).

But this is too slow (here ~4 seconds, but increasing a lot for higher myn).

我正在寻找实现这一目标的最快方法-竞赛!;-)

I am looking for the fastest way to implement this - Contest! ;-)

欢迎使用所有解决方案(包括并行化?)和软件包( dplyr data.table ??)-对于,我确实确实需要尽快例如,myn = 5000.

All solutions (incl. parallelizing?) and packages (dplyr, data.table?) are welcome - I really need something as fast as possible for myn = 5000 for example.

编辑实际上, getval 不是那么(容易吗?)可矢量化的...

EDIT Actually, getval is not so (easily?) vectorizable...

getval = function(a, b, v) {
  return(sum(a/(a/v +1) + b/(b+2) * v))
}

myDF$val = apply(myDF, 1, function(x) {getval(a=x[1], b=x[2], v=v1)})
head(myDF)
#             a  b      val
# 1 0.000000000 -1 6.182533
# 2 0.001001001 -1 6.282782
# 3 0.002002002 -1 6.383424
# 4 0.003003003 -1 6.484682
# 5 0.004004004 -1 6.586980
# 6 0.005005005 -1 6.691260

推荐答案

您应该尽最大努力避免循环遍历行.例如:

You should try as hard as possible to avoid looping over rows. For your example:

getval = function(a, b, v) {
  return((a + b / 2) *sum(v))
}

myDF$val1 = getval(myDF$a, myDF$b, v1)
head(myDF)
#            a  b      val     val1
#1 0.000000000 -1 3.091267 3.091267
#2 0.001001001 -1 3.085078 3.085078
#3 0.002002002 -1 3.078889 3.078889
#4 0.003003003 -1 3.072700 3.072700
#5 0.004004004 -1 3.066512 3.066512
#6 0.005005005 -1 3.060323 3.060323

您将无法击败这种矢量化解决方案的性能.如果在R中无法做到这一点,请使用Rcpp实现所有功能(包括循环).这样简单的功能并不难.

You won't be able to beat performance of such a vectorized solution. If this is not possible in R, implement everything (including the loop) with Rcpp. It's not difficult with such simple functions.

这是您的第二个示例的Rcpp函数.由于Rcpp糖功能(如 sum .

Here is an Rcpp function for your second example. It's quite simple because of Rcpp sugar functions such as sum.

library(Rcpp)
cppFunction(
  "
  NumericVector rcpp_geval(const NumericVector a, const NumericVector b, const NumericVector v) {
    const double n = a.length();
    NumericVector res(n);
    for (double i = 0; i < n; ++i) {
      res[i] = sum(a[i]/(a[i]/v +1) + b[i]/(b[i]+2) * v);
    }
    return res;
  }
  "
)

myDF$val1 <- rcpp_geval(myDF$a, myDF$b, v1)

head(myDF)
#            a  b      val     val1
#1 0.000000000 -1 6.182533 6.182533
#2 0.001001001 -1 6.282782 6.282782
#3 0.002002002 -1 6.383424 6.383424
#4 0.003003003 -1 6.484682 6.484682
#5 0.004004004 -1 6.586980 6.586980
#6 0.005005005 -1 6.691260 6.691260

这篇关于以最快,最有效的方式将大型data.frame的行作为R中的函数的参数传递的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆