拆分数据集并将子集并行传递,然后重新组合结果 [英] Split data set and pass the subsets in parallel to function then recombine the results

查看:261
本文介绍了拆分数据集并将子集并行传递,然后重新组合结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我正在尝试使用 foreach 包的方法。
我有600行和58000列有大量缺失值的数据集。



我们需要使用名为missForest的软件包来计算缺失值它不是平行的,需要很多时间来一次运行这些数据。
所以,我想将数据分成7个数据集(我有7个核心)具有相同的行数(我的行)和不同数量的col(标记)。
然后使用%dopar%并行传递数据集missForest?

看看如何将数据分成更小的数据集,并将这些数据集传递给missForest然后重新组合输出!

如果你能告诉我怎么做,我将非常感激!

下面是一个小例子,表格BLR包,展示了我的问题:

pre $ (小麦)
X2 < - prodNA(X,0.1)
dim(X2)##我需要分割X2到数个7个数据帧(ii)

X3 < - missForest(X2)

X3 $ Ximp ##组合ii数据帧
mclapply 直接或间接使用 doParallel 时,这不是问题。但在Windows上,输入数据通过套接字连接发送到集群工作人员,所以它可以是非常重要的。



对于这样的情况,我使用 isplitCol 函数从 itertools 包。它创建一个矩阵的列的迭代器。使用 chunks 参数,可以拆分矩阵,以便每个集群工作人员只准确获得一个子矩阵。



将你的例子翻译成 foreach ,它使用 isplitCol 将输入矩阵分成7个子矩阵,从而减少发送给每个工作人员相比于向每个工作人员自动输出 X2 的费用减少了七倍:

  library(itertools)
library(BLR)
library(missForest)
ncores< - 7
cl< - makePSOCKcluster (ncores)
registerDoParallel(cl)
data(小麦)
X2 < - prodNA(X,0.1)
X3 < - foreach(m = isplitCols(X2,chunk = ncores).combine ='cbind',
.packages ='missForest')%dopar%{
missForest(m)$ ximp
}
print(X3)
stopCluster(cl)


Here is what I am trying to do using the foreach package. I have data set with 600 rows and 58000 column with lots of missing values.

We need to impute the missing values using package called "missForest" in which it is not parallel, it takes to much time to run this data at once. so, I am thinking to divide the data into 7 data sets (I have 7 cores) with the same number of rows (my lines) and different number of col ( markers). Then using %dopar% to pass the data sets in parallel to missForest?

I do not see how to divide the data into smaller data sets and pass those data sets to missForest then recombine the outputs!

I will appreciate it so much if you can show me how?

Here is a small example, form BLR package, demonstrating my problem:

   library(BLR)
   library(missForest)
   data(wheat)
   X2<- prodNA(X, 0.1)
   dim(X2)                 ## i need to divide X2 to several 7 data frames (ii)

  X3<- missForest(X2)

  X3$Ximp  ## combine ii data frames

解决方案

When processing a large matrix in parallel, it can be very important to only pass as much data as is needed for each of the cluster workers. This isn't an issue when using mclapply, either directly or indirectly when using doParallel on Linux. But on Windows, input data is sent to the cluster workers via socket connections, so it can be very important.

For cases like this, I use the isplitCol function from the itertools package. It creates an iterator over blocks of columns of a matrix. Using the chunks argument, you can split the matrix so that each cluster worker gets exactly one submatrix.

Here's a translation of your example into foreach which uses isplitCol to split the input matrix into 7 submatrices, thus decreasing the data sent to each worker by a factor of seven compared to auto-exporting X2 to each worker:

library(doParallel)
library(itertools)
library(BLR)
library(missForest)
ncores <- 7
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
data(wheat)
X2 <- prodNA(X, 0.1)
X3 <- foreach(m=isplitCols(X2, chunks=ncores), .combine='cbind',
              .packages='missForest') %dopar% {
  missForest(m)$ximp
}
print(X3)
stopCluster(cl)

这篇关于拆分数据集并将子集并行传递,然后重新组合结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆