如何使用Torque / MOAB调度程序设置doSNOW和SOCK群集? [英] How to set up doSNOW and SOCK cluster with Torque/MOAB scheduler?

查看:171
本文介绍了如何使用Torque / MOAB调度程序设置doSNOW和SOCK群集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

继续这个问题(https://stackoverflow.com/questions/17222942/allow-foreach-workers-to-register-and-distribute-sub-tasks-to-other-workers),什么是最好的做法连接doSNOW和SOCK集群到Torque / MOAB调度器,以避免处理一个外部并行循环的代码的一部分内部并行循环中的处理器亲和力?

史蒂夫对这个问题的回答,没有与调度程序联合的基线代码可能是:

  library(doSNOW)
hosts< -c('host-1','host-2')
cl < - makeSOCKcluster主机)
registerDoSNOW(cl)
r < - foreach(i = 1:4,.packages ='doMC')%dopar%{
registerDoMC(2)
foreach j = 1:8,.combine ='c')%dopar%{
i * j
}

stopCluster(cl)


解决方案

Torque总是创建一个文件,其中包含由Moab分配给您的作业的节点名称,并通过 PBS_NODEFILE 环境变量将该文件的路径传递给您的作业。节点名称可能会多次列出,表明它为该节点上的作业分配了多个核心。在这种情况下,我们希望为 PBS_NODEFILE 中的每个唯一节点名称启动一个集群工作者,但要跟踪每个节点上分配的核心数,以便我们可以指定注册 doMC 时核心的数量正确。



这是一个读取 PBS_NODEFILE 并返回一个数据框,其中包含分配的节点信息:

  getnodes<  -  function(){ 
f< - sys.getenv('PBS_NODEFILE')
x< - if(nzchar(f))readLines(f)else rep('localhost',3)
as.data。 frame(table(x),stringsAsFactors = FALSE)
}

返回的数据框包含一个名为x的节点名称列和一个名为Freq的相应核心计数列。

这使得创建和注册一个SOCK集群每个唯一节点:$ b​​
$ b

 节点<  -  getnodes()
cl< - makeSOCKcluster(nodes $ x)
registerDoSN OW(cl)

现在我们可以轻松执行 foreach 循环,每个worker都有一个任务,但是把正确数量的分配核心传递给每个工作人员并不那么容易,而不需要依赖 snow doSNOW ,具体涉及 doSNOW clusterApplyLB 函数的实现C $ C>。当然,如果你知道在每个节点上分配内核的数量是相同的,那很容易,但是如果你想要一个通用的解决方案,这个问题就更难了。

一个(不是很优雅的)一般的解决方案是通过下雪 clusterApply 函数将分配的内核数量分配给每个worker的全局变量:

  setcores < -  function(cl,nodes){
f < - function(cores)assign('allocated.cores',cores ,pos = .GlobalEnv)
clusterApply(cl,nodes $ Freq,f)
}
setcores(cl,nodes)

这可以保证每个worker上的allocated.cores变量的值等于该节点出现在现在,我们可以在注册 doMC 时使用这个全局变量:

p>

  r < -  foreach(i = seq_along(nodes $ x),.packages ='doMC')%dopar%{
registerDoMC(全部(j = 1:allocated.cores,.combine ='c')%dopar%{
i * j
}
}

以下是一个可用于执行此R脚本的作业脚本示例:

 #!/ bin / sh 
#PBS -l nodes = 4:ppn = 8
cd$ PBS_O_WORKDIR
R - -slave -f hybridSOCK.R

当通过 qsub 命令,R脚本将创建一个包含四个worker的SOCK集群,并且每个worker将使用8个内核执行内部的 foreach 循环。但是由于R代码是一般的,它应该做正确的事情,不管通过 qsub 请求的资源。


In continuation of this question (https://stackoverflow.com/questions/17222942/allow-foreach-workers-to-register-and-distribute-sub-tasks-to-other-workers), what is a best practice to connect doSNOW and SOCK cluster to Torque/MOAB scheduler in order to avoid processor affinity in an inner parallel loop that handles some part of the code of an outer parallel loop?

From the Steve's answer to that question, the baseline code without intraction with the scheduler could be:

library(doSNOW)
hosts <- c('host-1', 'host-2')
cl <- makeSOCKcluster(hosts)
registerDoSNOW(cl)
r <- foreach(i=1:4, .packages='doMC') %dopar% {
  registerDoMC(2)
  foreach(j=1:8, .combine='c') %dopar% {
    i * j
  }
}
stopCluster(cl)  

解决方案

Torque always creates a file containing the node names that have been allocated to your job by Moab, and it passes the path of that file to your job via the PBS_NODEFILE environment variable. Node names may be listed multiple times to indicate that it allocated multiple cores to your job on that node. In this case, we want to start a cluster worker for each unique node name in PBS_NODEFILE, but keep track of the number of allocated cores on each of those nodes so we can specify the correct number of cores when registering doMC.

Here is a function that reads PBS_NODEFILE and returns a data frame with the allocated node information:

getnodes <- function() {
  f <- Sys.getenv('PBS_NODEFILE')
  x <- if (nzchar(f)) readLines(f) else rep('localhost', 3)
  as.data.frame(table(x), stringsAsFactors=FALSE)
}

The returned data frame contains a column named "x" of node names and a column named "Freq" of corresponding core counts.

This makes it simple to create and register a SOCK cluster with one worker per unique node:

nodes <- getnodes()
cl <- makeSOCKcluster(nodes$x)
registerDoSNOW(cl)

We can now easily execute a foreach loop with one task per worker, but it's not so easy to pass the correct number of allocated cores to each of those workers without depending on some implementation details of both snow and doSNOW, specifically relating to the implementation of the clusterApplyLB function used by doSNOW. Of course, it's easy if you happen to know that the number of allocated cores is the same on each node, but it's harder if you want a general solution to the problem.

One (not so elegant) general solution is to assign the number of allocated cores to a global variable on each of the workers via the snow clusterApply function:

setcores <- function(cl, nodes) {
  f <- function(cores) assign('allocated.cores', cores, pos=.GlobalEnv)
  clusterApply(cl, nodes$Freq, f)
}
setcores(cl, nodes)

This guarantees that the value of the "allocated.cores" variable on each of the workers is equal to the number of times that that node appeared in PBS_NODEFILE.

Now we can use that global variable when registering doMC:

r <- foreach(i=seq_along(nodes$x), .packages='doMC') %dopar% {
  registerDoMC(allocated.cores)
  foreach(j=1:allocated.cores, .combine='c') %dopar% {
    i * j
  }
}

Here is an example job script that could be used to execute this R script:

#!/bin/sh
#PBS -l nodes=4:ppn=8
cd "$PBS_O_WORKDIR"
R --slave -f hybridSOCK.R

When this is submitted via the qsub command, the R script will create a SOCK cluster with four workers, and each of those workers will execute the inner foreach loop using 8 cores. But since the R code is general, it should do the right thing regardless of the resources requested via qsub.

这篇关于如何使用Torque / MOAB调度程序设置doSNOW和SOCK群集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆