并行计算时的数据表的子集 [英] Subset of data table when computing in parallel

查看:110
本文介绍了并行计算时的数据表的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用data.table运行一个并行计算。我有一个大数据集,我想与每一组的主题独立和平行地工作。



Let:DataP是一个大数据集:ID,x1,x2,x3,group



我的代码是:

 #数据准备
#我拆分索引(indx),因为数据拆分需要很多时间我的数据。
setkey(DataP,SplitKey_f)
indx< -split(seq(nrow(DataP)),DataP $ group)
l< -length(unique(DataP $ group))

库(并行)
库(doParallel)
库(foreach)
cl <-makeCluster(8)
registerDoParallel(cl)

foreach(i = 1:l,.combine = rbind)%dopar%{
library(data.table)
Psubset <-DataP [,indx [[i]]]
#对数据的一些转换
}
stopCluster(cl)

不工作,因为foreach与并行计算不能执行行:



Psubset< -DataP [,indx [[i]]])。
$ b

但是,%do%而不是%dopar%工作良好(但很多时间)。



如何解决问题 - 在并行循环中快速设置data.table?

解决方案

来自@Roland的注释尽管如此,我实际上发现这种类型的方法可以非常有效,它使用它在40核EC2实例上并行化数百万行的数百万计算。



我要做的第一件事是确保您的密钥设置为您将用作子集化索引的列。你的索引是一个数组列表,这是一个有点不同,我通常做它,但它应该仍然可以工作。



请尝试以下操作:

  out<  -  
foreach(i = indx,.packages = c('data.table'),.combine = data.table :: rbind)%dopar%{
Psubset <-DataP [i,]
#对Psubset执行一些操作
}

或者如果由于某种原因

  out_list< ;  -  
foreach(i = indx,.packages = c('data.table'))%dopar%{
Psubset< -DataP [i,]


}
out< - rbindlist(outlist)#,fill = TRUE等。

如果这不起作用,我会看一下索引,使其工作更像:

  out<  -  
foreach(i = 1:max_indx,.packages = c('data.table'),.combine = data.table :: rbind)%dopar%{
Psubset< -DataP [indx == i,]
#对Psubset执行一些操作
}

但没有可重现的例子,很难知道哪一个最适合。


I'm trying to run a parallel computation with data.table. I have a big data set, and I'd like to work with every group of subjects independently and in parallel.

Let: DataP is a big data set: ID, x1, x2, x3, group

My code is:

# Data preparations
# I split an index (indx) because data split takes a lot of time with my data.
setkey(DataP ,SplitKey_f)
indx<-split(seq(nrow(DataP )),DataP $group)
l<-length(unique(DataP$group))

library(parallel)
library(doParallel)
library(foreach)
cl<-makeCluster(8)
registerDoParallel(cl)

foreach(i=1:l, .combine = rbind) %dopar% { 
  library(data.table)
  Psubset<-DataP [,indx[[i]]]
  # some transformations on the data
}
stopCluster(cl)

The above doesn't work because foreach with parallel computing cannot execute the line:

Psubset<-DataP [,indx[[i]]]).

However, %do% instead of %dopar% works good (but a lot of time).

How can I fix the problem - fast sub-setting a data.table within a parallel loop?

解决方案

The comment from @Roland above nonwithstanding, I've actually found that this type of approach can be very effective, and I've used it to parallelize millions of computations over millions of rows on a 40-core EC2 instance.

The first thing I would do is ensure that your key is set to the column you'll be using as an index for subsetting. Your index is a list of arrays, and that's a little different than I usually do it, but it should still work.

Try the following:

out <- 
  foreach(i = indx, .packages = c('data.table'), .combine = data.table::rbind ) %dopar% {
    Psubset<-DataP[i,]
    # do some operations on Psubset
  }

Or if for some reason combine isn't working or needs additional arguments you can do it after the fact.

out_list <- 
  foreach(i = indx, .packages = c('data.table') ) %dopar% {
    Psubset<-DataP[i,]
    # do some operations on Psubset
    )
  }
out <- rbindlist(outlist) #, fill=TRUE, etc.

If that doesn't work I would take a look at the index so that it works more like:

out <- 
  foreach(i = 1:max_indx, .packages = c('data.table'), .combine = data.table::rbind ) %dopar% {
    Psubset<-DataP[indx==i,]
    # do some operations on Psubset
  } 

But without a reproducible example it's hard to know which one would work best.

这篇关于并行计算时的数据表的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆