高效计算 data.table 列的线性组合 [英] Efficiently computing a linear combination of data.table columns

查看:17
本文介绍了高效计算 data.table 列的线性组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 data.table 中有 nc 列,在向量中有 nc 标量.我想对列进行 线性组合,但我不知道我将使用哪些列.最有效的方法是什么?

I have nc columns in a data.table, and nc scalars in a vector. I want to take a linear combination of the columns, but I don't know ahead of time which columns I will be using. What is the most efficient way to do this?

require(data.table)
set.seed(1)

n  <- 1e5
nc <- 5
cf <- setNames(rnorm(nc),LETTERS[1:nc])
DT <- setnames(data.table(replicate(nc,rnorm(n))),LETTERS[1:nc])

方法

假设我想使用前四列.我可以手动写:

ways to do it

Suppose I want to use the first four columns. I can manually write:

DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)]

我可以想到两种自动方式(在不知道应该全部使用 A-E 的情况下工作):

I can think of two automatic ways (that work without knowing that A-E should all be used):

mycols <- LETTERS[1:4] # the first four columns
DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols]
DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols]

基准测试

我希望 as.matrix 会使第二个选项变慢,并且对 Map-Reduce 组合的速度真的没有直觉.

benchmarking

I expect the as.matrix to make the second option slow, and really have no intuition for the speed of Map-Reduce combinations.

require(rbenchmark)
options(datatable.verbose=FALSE) # in case you have it turned on

benchmark(
    manual=DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)],
    coerce=DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols],
    maprdc=DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols]
)[,1:6]

    test replications elapsed relative user.self sys.self
2 coerce          100    2.47    1.342      1.95     0.51
1 manual          100    1.84    1.000      1.53     0.31
3 maprdc          100    2.40    1.304      1.62     0.75

当我重复 benchmark 调用时,相对于手动方法,我的速度会降低 5% 到 40%.

I get anywhere from a 5% to 40% percent slowdown relative to the manual approach when I repeat the benchmark call.

这里的尺寸 - nlength(mycols) - 与我正在使用的尺寸接近,但我将多次运行这些计算,改变系数向量,cf.

The dimensions here -- n and length(mycols) -- are close to what I am working with, but I will be running these computations many times, altering the coefficient vector, cf.

推荐答案

这对我来说比你的手动版本快 2 倍:

This is almost 2x faster for me than your manual version:

Reduce("+", lapply(names(DT), function(x) DT[[x]] * cf[x]))

benchmark(manual = DT[, list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)],
          reduce = Reduce('+', lapply(names(DT), function(x) DT[[x]] * cf[x])))
#    test replications elapsed relative user.self sys.self user.child sys.child
#1 manual          100    1.43    1.744      1.08     0.36         NA        NA
#2 reduce          100    0.82    1.000      0.58     0.24         NA        NA

若要仅迭代 mycols,请将 lapply 中的 names(DT) 替换为 mycols.

And to iterate over just mycols, replace names(DT) with mycols in lapply.

这篇关于高效计算 data.table 列的线性组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆