R与H2O并行处理 [英] Parallel processing in R with H2O

查看：189 发布时间：2018/1/24 22:02:26 r memory foreach parallel-processing h2o

本文介绍了R与H2O并行处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在设置一段代码，以并行方式处理我的数据中的N组计算，使用 foreach 。

我有一个计算涉及调用 h2o.gbm 。

在我目前的顺序集我使用了大约70％的内存。

如何在并行代码中正确设置h2o.init（）？我恐怕在使用多个内核时可能会耗尽内存。

我的Windows 10机器有12个内核和128GB的内存。 b

像这样的伪代码工作吗？

  library（foreach）
 library （doParallel）
 
 #setup并行后端使用12个处理器
 cl< -makeCluster（12）
 registerDoParallel（cl）
 
 #loop 
 df4< -foreach（i = as.numeric（seq（1,999）），.combine = rbind）％dopar％{
 df4 < -  data.frame（）
 #bb计算
 h2o.init（nthreads = 1，max_mem_size =10G）
 gbm<  -  h2o.gbm（train_some_model）
 df4<  -  data.frame（某些输出）
 } 
 
 fwrite（df4，append = TRUE）
 
 stopCluster（cl）

解决方案

您的代码当前设置的方式不是最好的选择。我明白你要做什么 - 并行执行一堆GBM（每个都在一个核心的H2O集群上），这样你可以最大限度地提高机器上12个内核的CPU使用率。但是，你的代码要做的就是在同一个单核心的H2O集群上并行运行你的 foreach 循环中的所有GBM。一次只能从一个R实例连接到一个H2O簇，但是foreach循环将创建一个新的R实例（请参见下面的选项＃4）。不同于R中的大多数机器学习算法，H2O算法都是多核的，所以训练过程将在算法级别上被并行化，而不需要并行R包，比如 foreach 。

您有几个选项（＃1或＃3可能是最好的）：

在脚本的顶部设置 h2o.init（nthreads = -1），使用全部12个你的核心。将 foreach（）循环更改为常规循环，并依次对每个GBM（在不同的数据分区上）进行训练。虽然不同的GBM是按顺序训练的，但是每个单独的GBM将完全平行于整个H2O集群。

在脚本顶部设置 h2o.init（nthreads = -1），但是保持 foreach（）循环。这应该一次运行所有GBM，每个GBM在所有内核之间并行化。这可能会压倒H2O簇（这实际上并不是真正意义上的H2O是如何使用的），并且可能比＃1慢一些，但是如果不知道数据的大小和分区的数量，很难说想要训练。如果您已经将70％的RAM用于单GBM，那么这可能不是最好的选择。

您可以更新您的代码以执行以下操作（与您的原始脚本最相似）。这将保留你的 foreach 循环，在机器上的不同端口创建一个新的1核心的H2O群集。请参阅下面的内容。

更新了R代码示例，它使用了虹膜数据集，并将预测的虹膜类作为data.frame返回：

  library（foreach）
 library（doParallel）
 library（h2o）
 h2o.shutdown prompt = FALSE）
 
 #setup并行后端使用12个处理器
 cl < -  makeCluster（12）
 registerDoParallel（cl）
 
 #loop 
 df4 < -  foreach（i = seq（20），.combine = rbind）％dopar％{
 library（h2o）
 port < -  54321 + 3 * i 
 print（paste0（http：// localhost：，port））
 h2o.init（nthreads = 1，max_mem_size =1G，port = port）
 df4 < -  data.frame （数据）
数据（虹膜）
数据<  -  as.h2o（虹膜）
 ss<  -  h2o.splitFrame<  -  h2o.gbm x = 1：4，y =Species，training_frame = ss [[1]]）
 df4 < -  as.data.frame（h2o.predict（gbm，ss [[2]]））[ ，1] 
}

为了判断哪个选项最好，这几个数据p （也许10-100），看看哪种方法似乎是最好的。如果你的训练数据很小，＃3可能会比＃1快，但总的来说，我认为＃1可能是最具可扩展性/稳定性的解决方案。

I am setting up a piece of code to parallel processes some computations for N groups in my data using foreach.

I have a computation that involves a call to h2o.gbm.

In my current, sequential set-up, I use up to about 70% of my RAM.

How do I correctly set-up my h2o.init() within the parallel piece of code? I am afraid that I might run out of RAM when I use multiple cores.

My Windows 10 machine has 12 cores and 128GB of RAM.

Would something like this pseudo-code work?

library(foreach)
library(doParallel)

#setup parallel backend to use 12 processors
cl<-makeCluster(12)
registerDoParallel(cl)

#loop
df4 <-foreach(i = as.numeric(seq(1,999)), .combine=rbind) %dopar% {
  df4 <- data.frame()
  #bunch of computations
  h2o.init(nthreads=1, max_mem_size="10G")
  gbm <- h2o.gbm(train_some_model)
  df4 <- data.frame(someoutput)
   }

fwrite(df4, append=TRUE)

stopCluster(cl)

解决方案

The way your code is currently set up won't be the best option. I understand what you are trying to do -- execute a bunch of GBMs in parallel (each on a single core H2O cluster), so you can maximize the CPU usage across the 12 cores on your machine. However, what your code will do is try to run all the GBMs in your foreach loop in parallel on the same single-core H2O cluster. You can only connect to one H2O cluster at a time from a single R instance, however the foreach loop will create a new R instance (See option #4 below).

Unlike most machine learning algos in R, the H2O algos are all multi-core enabled so the training process will already be parallelized at the algorithm level, without the need for a parallel R package like foreach.

You have a few options (#1 or #3 is probably best):

Set h2o.init(nthreads = -1) at the top of your script to use all 12 of your cores. Change the foreach() loop to a regular loop and train each GBM (on a different data partition) sequentially. Although the different GBMs are trained sequentially, each single GBM will be fully parallelized across the H2O cluster.
Set h2o.init(nthreads = -1) at the top of your script, but keep your foreach() loop. This should run all your GBMs at once, with each GBM parallelized across all cores. This could overwhelm the H2O cluster a bit (this is not really how H2O is meant to be used) and could be a bit slower than #1, but it's hard to say without knowing the size of your data and the number of partitions of you want to train on. If you are already using 70% of your RAM for a single GBM, then this might not be the best option.
You can update your code to do the following (which most closely resembles your original script). This will preserve your foreach loop, creating a new 1-core H2O cluster at a different port on your machine. See below.

Updated R code example which uses the iris dataset and returns the predicted class for iris as a data.frame:

library(foreach)
library(doParallel)
library(h2o)
h2o.shutdown(prompt = FALSE)

#setup parallel backend to use 12 processors
cl <- makeCluster(12)
registerDoParallel(cl)

#loop
df4 <- foreach(i = seq(20), .combine=rbind) %dopar% {
  library(h2o)
  port <- 54321 + 3*i
  print(paste0("http://localhost:", port))
  h2o.init(nthreads = 1, max_mem_size = "1G", port = port)
  df4 <- data.frame()
  data(iris)
  data <- as.h2o(iris)
  ss <- h2o.splitFrame(data)
  gbm <- h2o.gbm(x = 1:4, y = "Species", training_frame = ss[[1]])
  df4 <- as.data.frame(h2o.predict(gbm, ss[[2]]))[,1]
}

In order to judge which option is best, I would try running this on a few data partitions (maybe 10-100) to see which approach seems to scale the best. If your training data is small, it's possible that #3 will be faster than #1, but overall, I'd say #1 is probably the most scalable/stable solution.

这篇关于R与H2O并行处理的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R与H2O并行处理 [英] Parallel processing in R with H2O

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R与H2O并行处理 [英] Parallel processing in R with H2O

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭