并行计算时的数据表的子集 [英] Subset of data table when computing in parallel

查看：110 发布时间：2017/3/12 13:05:48 r foreach split data.table

本文介绍了并行计算时的数据表的子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图用data.table运行一个并行计算。我有一个大数据集，我想与每一组的主题独立和平行地工作。

Let：DataP是一个大数据集：ID，x1，x2，x3，group

我的代码是：

 ＃数据准备
＃我拆分索引（indx），因为数据拆分需要很多时间我的数据。 
 setkey（DataP，SplitKey_f）
 indx< -split（seq（nrow（DataP）），DataP $ group）
 l< -length（unique（DataP $ group））
 
库（并行）
库（doParallel）
库（foreach）
 cl <-makeCluster（8）
 registerDoParallel（cl）
 
 foreach（i = 1：l，.combine = rbind）％dopar％{
 library（data.table）
 Psubset <-DataP [，indx [[i]]] 
 ＃对数据的一些转换
} 
 stopCluster（cl）

不工作，因为foreach与并行计算不能执行行：

Psubset< -DataP [，indx [[i]]]）。
$ b

但是，％do％而不是％dopar％工作良好（但很多时间）。

如何解决问题 - 在并行循环中快速设置data.table？

解决方案

来自@Roland的注释尽管如此，我实际上发现这种类型的方法可以非常有效，它使用它在40核EC2实例上并行化数百万行的数百万计算。

我要做的第一件事是确保您的密钥设置为您将用作子集化索引的列。你的索引是一个数组列表，这是一个有点不同，我通常做它，但它应该仍然可以工作。

请尝试以下操作：

  out<  -  
 foreach（i = indx，.packages = c（'data.table'），.combine = data.table :: rbind）％dopar％{
 Psubset <-DataP [i，] 
＃对Psubset执行一些操作
}

或者如果由于某种原因

  out_list< ;  -  
 foreach（i = indx，.packages = c（'data.table'））％dopar％{
 Psubset< -DataP [i，] 
＃ 
）
} 
 out<  -  rbindlist（outlist）＃，fill = TRUE等。

如果这不起作用，我会看一下索引，使其工作更像：

  out<  -  
 foreach（i = 1：max_indx，.packages = c（'data.table'），.combine = data.table :: rbind）％dopar％{
 Psubset< -DataP [indx == i，] 
＃对Psubset执行一些操作
}

但没有可重现的例子，很难知道哪一个最适合。

 
I'm trying to run a parallel computation with data.table. I have a big data set, and I'd like to work with every group of subjects independently and in parallel. 

Let: DataP is a big data set: ID, x1, x2, x3, group

My code is:
# Data preparations
# I split an index (indx) because data split takes a lot of time with my data.
setkey(DataP ,SplitKey_f)
indx<-split(seq(nrow(DataP )),DataP $group)
l<-length(unique(DataP$group))

library(parallel)
library(doParallel)
library(foreach)
cl<-makeCluster(8)
registerDoParallel(cl)

foreach(i=1:l, .combine = rbind) %dopar% { 
  library(data.table)
  Psubset<-DataP [,indx[[i]]]
  # some transformations on the data
}
stopCluster(cl)
The above doesn't work because foreach with parallel computing cannot execute the line:

Psubset<-DataP [,indx[[i]]]).

However, %do% instead of %dopar% works good (but a lot of time).

How can I fix the problem - fast sub-setting a data.table within a parallel loop?  
 解决方案 
The comment from @Roland above nonwithstanding, I've actually found that this type of approach can be very effective, and I've used it to parallelize millions of computations over millions of rows on a 40-core EC2 instance.

The first thing I would do is ensure that your key is set to the column you'll be using as an index for subsetting.  Your index is a list of arrays, and that's a little different than I usually do it, but it should still work. 

Try the following:
out <- 
  foreach(i = indx, .packages = c('data.table'), .combine = data.table::rbind ) %dopar% {
    Psubset<-DataP[i,]
    # do some operations on Psubset
  }
Or if for some reason combine isn't working or needs additional arguments you can do it after the fact.
out_list <- 
  foreach(i = indx, .packages = c('data.table') ) %dopar% {
    Psubset<-DataP[i,]
    # do some operations on Psubset
    )
  }
out <- rbindlist(outlist) #, fill=TRUE, etc.
If that doesn't work I would take a look at the index so that it works more like:
out <- 
  foreach(i = 1:max_indx, .packages = c('data.table'), .combine = data.table::rbind ) %dopar% {
    Psubset<-DataP[indx==i,]
    # do some operations on Psubset
  } 
But without a reproducible example it's hard to know which one would work best.

                        这篇关于并行计算时的数据表的子集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

并行计算时的数据表的子集 [英] Subset of data table when computing in parallel

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

并行计算时的数据表的子集 [英] Subset of data table when computing in parallel

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭