在R中使用带多个线程的data.table [英] using data.table with multiple threads in R

查看:213
本文介绍了在R中使用带多个线程的data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法利用多线程的计算使用 data.table 在R?例如,假设我有以下 data.table

Is there a way to utilize multiple threads for computation using data.table in R? For example let's say i have the following data.table:

dtb <- data.table(id=rep(1:10000, 1000), x=1:1e7)
setkey(dtb, id)
f <- function(m) { #some really complicated function }
res <- dtb[,f(x), by=id]

如果 f 需要一段时间来计算,是否有办法将R转换为多线程?在 f 很快的情况下,多线程的帮助或大部分时间由 data.table 将分割成组?

Is there a way to get R to multithread this if f takes a while to compute? What about in the case that f is quick, will multithreading help or is most of the time going to be taken by data.table in splitting things up into groups?

推荐答案

我不确定这是多线程,但也许你是否意味着包括一个多核解决方案?如果是,请查看此早期答案:通过数据子集执行计算在R 中搜索[r] [data.table] parallel

I am not sure that this is "multi-threading", but perhaps you meant to include a multi-core solution? If so, then look at this earlier answer: Performing calculations by subsets of data in R found with a search for "[r] [data.table] parallel"

(在4核机器上速度加倍,我的系统监视器建议这只在 mclapply 调用期间使用了2个核心。)从此线程复制的代码: http://r.789695.n4.nabble.com/Access-to -local-variables-in-quot-j-quot-expressions-tt2315330.html#a2315337

(doubling of speed on a 4 core machine, but my system monitor suggests this only used 2 cores during the mclapply call.) Code copied from this thread: http://r.789695.n4.nabble.com/Access-to-local-variables-in-quot-j-quot-expressions-tt2315330.html#a2315337

 calc.fake.dt.mclapply <- function (dt) {
     mclapply(6*c(1000,1:4,6,8,10),
              function(critical.age) {
                  dt$tmp <-  pmax((dt$age <  critical.age) * dt$x, 0)
                  dt[, cumsum.lag(tmp), by = grp]$V1})
 } 
 mk.fake.df <- function (n.groups=10000, n.per.group=70) {
    data.frame(grp=rep(1:n.groups, each=n.per.group),
               age=rep(0:(n.per.group-1), n.groups),
               x=rnorm(n.groups * n.per.group),
               ## These don't do anything, but only exist to give
               ## the table a similar size to the real data.
               y1=rnorm(n.groups * n.per.group),
               y2=rnorm(n.groups * n.per.group),
               y3=rnorm(n.groups * n.per.group),
               y4=rnorm(n.groups * n.per.group)) } 
 df <- mk.fake.df 
 df <- mk.fake.df()
 calc.fake.dt.lapply <- function (dt) { # use base lapply for testing
     lapply(6*c(1000,1:4,6,8,10),
            function(critical.age) {
                dt$tmp <-  pmax((dt$age <  critical.age) * dt$x, 0)
                dt[, cumsum.lag(tmp), by = grp]$V1})
 } 
 mk.fake.dt <- function (fake.df) {
    fake.dt <- as.data.table(fake.df)
    setkey(fake.dt, grp, age)
    fake.dt
  } 
 dt <- mk.fake.dt()

require(data.table)
dt <- mk.fake.dt(df)

 cumsum.lag <- function (x) {
    x.prev <- c(0, x[-length(x)])
    cumsum(x.prev)
  } 
 system.time(res.dt.mclapply <- calc.fake.dt.mclapply(dt))
  user  system elapsed 
 1.896   4.413   1.210 

system.time(res.dt.lapply   <- calc.fake.dt.lapply(dt))
   user  system elapsed 
  1.391   0.793   2.175 

这篇关于在R中使用带多个线程的data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆