并行计算时的数据表的子集 [英] Subset of data table when computing in parallel
问题描述
我试图用data.table运行一个并行计算。我有一个大数据集,我想与每一组的主题独立和平行地工作。
Let:DataP是一个大数据集:ID,x1,x2,x3,group
我的代码是:
#数据准备
#我拆分索引(indx),因为数据拆分需要很多时间我的数据。
setkey(DataP,SplitKey_f)
indx< -split(seq(nrow(DataP)),DataP $ group)
l< -length(unique(DataP $ group))
库(并行)
库(doParallel)
库(foreach)
cl <-makeCluster(8)
registerDoParallel(cl)
foreach(i = 1:l,.combine = rbind)%dopar%{
library(data.table)
Psubset <-DataP [,indx [[i]]]
#对数据的一些转换
}
stopCluster(cl)
不工作,因为foreach与并行计算不能执行行:
Psubset< -DataP [,indx [[i]]])。
$ b
但是,%do%而不是%dopar%工作良好(但很多时间)。
如何解决问题 - 在并行循环中快速设置data.table?
来自@Roland的注释尽管如此,我实际上发现这种类型的方法可以非常有效,它使用它在40核EC2实例上并行化数百万行的数百万计算。
我要做的第一件事是确保您的密钥设置为您将用作子集化索引的列。你的索引是一个数组列表,这是一个有点不同,我通常做它,但它应该仍然可以工作。
请尝试以下操作:
out< -
foreach(i = indx,.packages = c('data.table'),.combine = data.table :: rbind)%dopar%{
Psubset <-DataP [i,]
#对Psubset执行一些操作
}
或者如果由于某种原因
out_list< ; -
foreach(i = indx,.packages = c('data.table'))%dopar%{
Psubset< -DataP [i,]
#
)
}
out< - rbindlist(outlist)#,fill = TRUE等。
如果这不起作用,我会看一下索引,使其工作更像:
out< -
foreach(i = 1:max_indx,.packages = c('data.table'),.combine = data.table :: rbind)%dopar%{
Psubset< -DataP [indx == i,]
#对Psubset执行一些操作
}
但没有可重现的例子,很难知道哪一个最适合。
I'm trying to run a parallel computation with data.table. I have a big data set, and I'd like to work with every group of subjects independently and in parallel.
Let: DataP is a big data set: ID, x1, x2, x3, group
My code is:
# Data preparations
# I split an index (indx) because data split takes a lot of time with my data.
setkey(DataP ,SplitKey_f)
indx<-split(seq(nrow(DataP )),DataP $group)
l<-length(unique(DataP$group))
library(parallel)
library(doParallel)
library(foreach)
cl<-makeCluster(8)
registerDoParallel(cl)
foreach(i=1:l, .combine = rbind) %dopar% {
library(data.table)
Psubset<-DataP [,indx[[i]]]
# some transformations on the data
}
stopCluster(cl)
The above doesn't work because foreach with parallel computing cannot execute the line:
Psubset<-DataP [,indx[[i]]]).
However, %do% instead of %dopar% works good (but a lot of time).
How can I fix the problem - fast sub-setting a data.table within a parallel loop?
The comment from @Roland above nonwithstanding, I've actually found that this type of approach can be very effective, and I've used it to parallelize millions of computations over millions of rows on a 40-core EC2 instance.
The first thing I would do is ensure that your key is set to the column you'll be using as an index for subsetting. Your index is a list of arrays, and that's a little different than I usually do it, but it should still work.
Try the following:
out <-
foreach(i = indx, .packages = c('data.table'), .combine = data.table::rbind ) %dopar% {
Psubset<-DataP[i,]
# do some operations on Psubset
}
Or if for some reason combine
isn't working or needs additional arguments you can do it after the fact.
out_list <-
foreach(i = indx, .packages = c('data.table') ) %dopar% {
Psubset<-DataP[i,]
# do some operations on Psubset
)
}
out <- rbindlist(outlist) #, fill=TRUE, etc.
If that doesn't work I would take a look at the index so that it works more like:
out <-
foreach(i = 1:max_indx, .packages = c('data.table'), .combine = data.table::rbind ) %dopar% {
Psubset<-DataP[indx==i,]
# do some operations on Psubset
}
But without a reproducible example it's hard to know which one would work best.
这篇关于并行计算时的数据表的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!