以最快,最有效的方式将大型data.frame的行作为R中的函数的参数传递 [英] Fastest and most efficient way to pass rows of a large data.frame as arguments to a function in R
问题描述
我想在data.frame myDF
中建立一个新列,该列是函数 getval
将每一行中的元素作为返回的每一行的值论点. getval
还使用外部向量 v1
作为参数.例如:
I would like to build a new column in a data.frame myDF
which is the value returned for each row by a function getval
taking the elements in this row as arguments. getval
also uses an external vector v1
as argument. For example:
myn = 1000
a = seq(0, 1, length.out = myn)
b = seq(-1, 1, length.out = myn)
myDF = expand.grid(a=a, b=b)
set.seed(13)
v1 = rnorm(100)
getval = function(a, b, v) {
return(sum(a*v + b/2*v))
}
myDF$val = apply(myDF, 1, function(x) {getval(a=x[1], b=x[2], v=v1)})
head(myDF)
# a b val
# 1 0.000000000 -1 3.091267
# 2 0.001001001 -1 3.085078
# 3 0.002002002 -1 3.078889
# 4 0.003003003 -1 3.072700
# 5 0.004004004 -1 3.066512
# 6 0.005005005 -1 3.060323
但这太慢了(这里是〜4秒,但是对于更高的 myn
会增加很多).
But this is too slow (here ~4 seconds, but increasing a lot for higher myn
).
我正在寻找实现这一目标的最快方法-竞赛!;-)
I am looking for the fastest way to implement this - Contest! ;-)
欢迎使用所有解决方案(包括并行化?)和软件包( dplyr
, data.table
??)-对于,我确实确实需要尽快例如,myn
= 5000.
All solutions (incl. parallelizing?) and packages (dplyr
, data.table
?) are welcome - I really need something as fast as possible for myn
= 5000 for example.
编辑实际上, getval
不是那么(容易吗?)可矢量化的...
EDIT
Actually, getval
is not so (easily?) vectorizable...
getval = function(a, b, v) {
return(sum(a/(a/v +1) + b/(b+2) * v))
}
myDF$val = apply(myDF, 1, function(x) {getval(a=x[1], b=x[2], v=v1)})
head(myDF)
# a b val
# 1 0.000000000 -1 6.182533
# 2 0.001001001 -1 6.282782
# 3 0.002002002 -1 6.383424
# 4 0.003003003 -1 6.484682
# 5 0.004004004 -1 6.586980
# 6 0.005005005 -1 6.691260
推荐答案
您应该尽最大努力避免循环遍历行.例如:
You should try as hard as possible to avoid looping over rows. For your example:
getval = function(a, b, v) {
return((a + b / 2) *sum(v))
}
myDF$val1 = getval(myDF$a, myDF$b, v1)
head(myDF)
# a b val val1
#1 0.000000000 -1 3.091267 3.091267
#2 0.001001001 -1 3.085078 3.085078
#3 0.002002002 -1 3.078889 3.078889
#4 0.003003003 -1 3.072700 3.072700
#5 0.004004004 -1 3.066512 3.066512
#6 0.005005005 -1 3.060323 3.060323
您将无法击败这种矢量化解决方案的性能.如果在R中无法做到这一点,请使用Rcpp实现所有功能(包括循环).这样简单的功能并不难.
You won't be able to beat performance of such a vectorized solution. If this is not possible in R, implement everything (including the loop) with Rcpp. It's not difficult with such simple functions.
这是您的第二个示例的Rcpp函数.由于Rcpp糖功能(如 sum
.
Here is an Rcpp function for your second example. It's quite simple because of Rcpp sugar functions such as sum
.
library(Rcpp)
cppFunction(
"
NumericVector rcpp_geval(const NumericVector a, const NumericVector b, const NumericVector v) {
const double n = a.length();
NumericVector res(n);
for (double i = 0; i < n; ++i) {
res[i] = sum(a[i]/(a[i]/v +1) + b[i]/(b[i]+2) * v);
}
return res;
}
"
)
myDF$val1 <- rcpp_geval(myDF$a, myDF$b, v1)
head(myDF)
# a b val val1
#1 0.000000000 -1 6.182533 6.182533
#2 0.001001001 -1 6.282782 6.282782
#3 0.002002002 -1 6.383424 6.383424
#4 0.003003003 -1 6.484682 6.484682
#5 0.004004004 -1 6.586980 6.586980
#6 0.005005005 -1 6.691260 6.691260
这篇关于以最快,最有效的方式将大型data.frame的行作为R中的函数的参数传递的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!