高效计算 data.table 列的线性组合 [英] Efficiently computing a linear combination of data.table columns
问题描述
我在 data.table 中有 nc
列,在向量中有 nc
标量.我想对列进行 线性组合,但我不知道我将使用哪些列.最有效的方法是什么?
I have nc
columns in a data.table, and nc
scalars in a vector. I want to take a linear combination of the columns, but I don't know ahead of time which columns I will be using. What is the most efficient way to do this?
require(data.table)
set.seed(1)
n <- 1e5
nc <- 5
cf <- setNames(rnorm(nc),LETTERS[1:nc])
DT <- setnames(data.table(replicate(nc,rnorm(n))),LETTERS[1:nc])
方法
假设我想使用前四列.我可以手动写:
ways to do it
Suppose I want to use the first four columns. I can manually write:
DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)]
我可以想到两种自动方式(在不知道应该全部使用 A-E 的情况下工作):
I can think of two automatic ways (that work without knowing that A-E should all be used):
mycols <- LETTERS[1:4] # the first four columns
DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols]
DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols]
基准测试
我希望 as.matrix
会使第二个选项变慢,并且对 Map
-Reduce
组合的速度真的没有直觉.
benchmarking
I expect the as.matrix
to make the second option slow, and really have no intuition for the speed of Map
-Reduce
combinations.
require(rbenchmark)
options(datatable.verbose=FALSE) # in case you have it turned on
benchmark(
manual=DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)],
coerce=DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols],
maprdc=DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols]
)[,1:6]
test replications elapsed relative user.self sys.self
2 coerce 100 2.47 1.342 1.95 0.51
1 manual 100 1.84 1.000 1.53 0.31
3 maprdc 100 2.40 1.304 1.62 0.75
当我重复 benchmark
调用时,相对于手动方法,我的速度会降低 5% 到 40%.
I get anywhere from a 5% to 40% percent slowdown relative to the manual approach when I repeat the benchmark
call.
这里的尺寸 - n
和 length(mycols)
- 与我正在使用的尺寸接近,但我将多次运行这些计算,改变系数向量,cf
.
The dimensions here -- n
and length(mycols)
-- are close to what I am working with, but I will be running these computations many times, altering the coefficient vector, cf
.
推荐答案
这对我来说比你的手动版本快 2 倍:
This is almost 2x faster for me than your manual version:
Reduce("+", lapply(names(DT), function(x) DT[[x]] * cf[x]))
benchmark(manual = DT[, list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)],
reduce = Reduce('+', lapply(names(DT), function(x) DT[[x]] * cf[x])))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 manual 100 1.43 1.744 1.08 0.36 NA NA
#2 reduce 100 0.82 1.000 0.58 0.24 NA NA
若要仅迭代 mycols
,请将 lapply
中的 names(DT)
替换为 mycols
.
And to iterate over just mycols
, replace names(DT)
with mycols
in lapply
.
这篇关于高效计算 data.table 列的线性组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!