有效地计算data.table列的线性组合 [英] Efficiently computing a linear combination of data.table columns

查看：102 发布时间：2017/3/12 10:02:10 r performance linear-algebra data.table

本文介绍了有效地计算data.table列的线性组合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在数据表中有 nc 列，向量中有 nc 标量。我想采取线性组合列，但我不提前知道我将使用哪些列。 最有效的方法是什么？

I have nc columns in a data.table, and nc scalars in a vector. I want to take a linear combination of the columns, but I don't know ahead of time which columns I will be using. What is the most efficient way to do this?

require(data.table)
set.seed(1)

n  <- 1e5
nc <- 5
cf <- setNames(rnorm(nc),LETTERS[1:nc])
DT <- setnames(data.table(replicate(nc,rnorm(n))),LETTERS[1:nc])

方法

使用前四列。我可以手动写：

ways to do it

Suppose I want to use the first four columns. I can manually write:

DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)]

我可以想到两种自动方式（该工作不知道AE应全部使用）：

I can think of two automatic ways (that work without knowing that A-E should all be used):

mycols <- LETTERS[1:4] # the first four columns
DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols]
DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols]

基准测试

我希望 as.matrix 使第二个选项变慢，对地图 - 减少组合的速度没有直觉。

benchmarking

I expect the as.matrix to make the second option slow, and really have no intuition for the speed of Map-Reduce combinations.

require(rbenchmark)
options(datatable.verbose=FALSE) # in case you have it turned on

benchmark(
    manual=DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)],
    coerce=DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols],
    maprdc=DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols]
)[,1:6]

    test replications elapsed relative user.self sys.self
2 coerce          100    2.47    1.342      1.95     0.51
1 manual          100    1.84    1.000      1.53     0.31
3 maprdc          100    2.40    1.304      1.62     0.75

当我重复基准调用时，相对于手动方法，我可以从5％到40％的速度下降。

I get anywhere from a 5% to 40% percent slowdown relative to the manual approach when I repeat the benchmark call.

这里的维度 - n 和 length（mycols） - 接近我正在使用的，但我将运行这些计算多次，改变系数向量， cf 。

The dimensions here -- n and length(mycols) -- are close to what I am working with, but I will be running these computations many times, altering the coefficient vector, cf.

推荐答案

这比我的手册版本快两倍：

This is almost 2x faster for me than your manual version:

Reduce("+", lapply(names(DT), function(x) DT[[x]] * cf[x]))

benchmark(manual = DT[, list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)],
          reduce = Reduce('+', lapply(names(DT), function(x) DT[[x]] * cf[x])))
#    test replications elapsed relative user.self sys.self user.child sys.child
#1 manual          100    1.43    1.744      1.08     0.36         NA        NA
#2 reduce          100    0.82    1.000      0.58     0.24         NA        NA

要遍历 mycols ，请替换 names（DT）与 mycols 在 lapply 。

这篇关于有效地计算data.table列的线性组合的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有效地计算data.table列的线性组合 [英] Efficiently computing a linear combination of data.table columns

问题描述

方法

ways to do it

基准测试

benchmarking

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有效地计算data.table列的线性组合 [英] Efficiently computing a linear combination of data.table columns

问题描述

方法

ways to do it

基准测试

benchmarking

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭