在 R 中使用具有多个线程的 data.table [英] using data.table with multiple threads in R
问题描述
有没有办法在 R 中使用 data.table
来利用多个线程进行计算?例如,假设我有以下 data.table
:
Is there a way to utilize multiple threads for computation using data.table
in R? For example let's say i have the following data.table
:
dtb <- data.table(id=rep(1:10000, 1000), x=1:1e7)
setkey(dtb, id)
f <- function(m) { #some really complicated function }
res <- dtb[,f(x), by=id]
如果 f
需要一段时间来计算,有没有办法让 R 多线程?如果 f
速度很快,多线程是否会有所帮助,或者 data.table
将花费大部分时间来将事情分成几组呢?>
Is there a way to get R to multithread this if f
takes a while to compute? What about in the case that f
is quick, will multithreading help or is most of the time going to be taken by data.table
in splitting things up into groups?
推荐答案
我不确定这是多线程",但也许您打算包含多核解决方案?如果是这样,请查看之前的答案:通过数据子集执行计算在 R 中通过搜索[r] [data.table] parallel"找到
I am not sure that this is "multi-threading", but perhaps you meant to include a multi-core solution? If so, then look at this earlier answer: Performing calculations by subsets of data in R found with a search for "[r] [data.table] parallel"
(在 4 核机器上速度加倍,但我的系统监视器建议在 mclapply
调用期间仅使用 2 个内核.)从该线程复制的代码:http://r.789695.n4.nabble.com/Access-to-local-variables-in-quot-j-quot-expressions-tt2315330.html#a2315337
(doubling of speed on a 4 core machine, but my system monitor suggests this only used 2 cores during the mclapply
call.) Code copied from this thread: http://r.789695.n4.nabble.com/Access-to-local-variables-in-quot-j-quot-expressions-tt2315330.html#a2315337
calc.fake.dt.mclapply <- function (dt) {
mclapply(6*c(1000,1:4,6,8,10),
function(critical.age) {
dt$tmp <- pmax((dt$age < critical.age) * dt$x, 0)
dt[, cumsum.lag(tmp), by = grp]$V1})
}
mk.fake.df <- function (n.groups=10000, n.per.group=70) {
data.frame(grp=rep(1:n.groups, each=n.per.group),
age=rep(0:(n.per.group-1), n.groups),
x=rnorm(n.groups * n.per.group),
## These don't do anything, but only exist to give
## the table a similar size to the real data.
y1=rnorm(n.groups * n.per.group),
y2=rnorm(n.groups * n.per.group),
y3=rnorm(n.groups * n.per.group),
y4=rnorm(n.groups * n.per.group)) }
df <- mk.fake.df
df <- mk.fake.df()
calc.fake.dt.lapply <- function (dt) { # use base lapply for testing
lapply(6*c(1000,1:4,6,8,10),
function(critical.age) {
dt$tmp <- pmax((dt$age < critical.age) * dt$x, 0)
dt[, cumsum.lag(tmp), by = grp]$V1})
}
mk.fake.dt <- function (fake.df) {
fake.dt <- as.data.table(fake.df)
setkey(fake.dt, grp, age)
fake.dt
}
dt <- mk.fake.dt()
require(data.table)
dt <- mk.fake.dt(df)
cumsum.lag <- function (x) {
x.prev <- c(0, x[-length(x)])
cumsum(x.prev)
}
system.time(res.dt.mclapply <- calc.fake.dt.mclapply(dt))
user system elapsed
1.896 4.413 1.210
system.time(res.dt.lapply <- calc.fake.dt.lapply(dt))
user system elapsed
1.391 0.793 2.175
这篇关于在 R 中使用具有多个线程的 data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!