如何使用 Torque/MOAB 调度程序设置 doSNOW 和 SOCK 集群? [英] How to set up doSNOW and SOCK cluster with Torque/MOAB scheduler?

查看:19
本文介绍了如何使用 Torque/MOAB 调度程序设置 doSNOW 和 SOCK 集群?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

继续这个问题(https://stackoverflow.com/questions/17222942/allow-foreach-workers-to-register-and-distribute-sub-tasks-to-other-workers),什么是连接doSNOW的最佳实践和 SOCK 集群到 Torque/MOAB 调度程序,以避免处理外部并行循环的某些部分代码的内部并行循环中的处理器关联?

In continuation of this question (https://stackoverflow.com/questions/17222942/allow-foreach-workers-to-register-and-distribute-sub-tasks-to-other-workers), what is a best practice to connect doSNOW and SOCK cluster to Torque/MOAB scheduler in order to avoid processor affinity in an inner parallel loop that handles some part of the code of an outer parallel loop?

史蒂夫对该问题的回答,没有与调度程序交互的基线代码可能是:

From the Steve's answer to that question, the baseline code without intraction with the scheduler could be:

library(doSNOW)
hosts <- c('host-1', 'host-2')
cl <- makeSOCKcluster(hosts)
registerDoSNOW(cl)
r <- foreach(i=1:4, .packages='doMC') %dopar% {
  registerDoMC(2)
  foreach(j=1:8, .combine='c') %dopar% {
    i * j
  }
}
stopCluster(cl)  

推荐答案

Torque 始终会创建一个文件,其中包含 Moab 已分配给您的作业的节点名称,并通过 PBS_NODEFILE 环境变量.节点名称可能会多次列出,以表明它在该节点上为您的作业分配了多个内核.在这种情况下,我们希望为 PBS_NODEFILE 中的每个唯一节点名称启动一个集群工作程序,但要跟踪每个节点上分配的内核数,以便我们可以在以下情况下指定正确的内核数注册doMC.

Torque always creates a file containing the node names that have been allocated to your job by Moab, and it passes the path of that file to your job via the PBS_NODEFILE environment variable. Node names may be listed multiple times to indicate that it allocated multiple cores to your job on that node. In this case, we want to start a cluster worker for each unique node name in PBS_NODEFILE, but keep track of the number of allocated cores on each of those nodes so we can specify the correct number of cores when registering doMC.

这是一个函数,它读取PBS_NODEFILE并返回一个带有分配节点信息的数据帧:

Here is a function that reads PBS_NODEFILE and returns a data frame with the allocated node information:

getnodes <- function() {
  f <- Sys.getenv('PBS_NODEFILE')
  x <- if (nzchar(f)) readLines(f) else rep('localhost', 3)
  as.data.frame(table(x), stringsAsFactors=FALSE)
}

返回的数据框包含节点名x"列和对应核数Freq"列.

The returned data frame contains a column named "x" of node names and a column named "Freq" of corresponding core counts.

这使得创建和注册 SOCK 集群变得简单,每个唯一节点有一个工作程序:

This makes it simple to create and register a SOCK cluster with one worker per unique node:

nodes <- getnodes()
cl <- makeSOCKcluster(nodes$x)
registerDoSNOW(cl)

我们现在可以轻松地执行一个 foreach 循环,每个 worker 有一个任务,但是在不依赖于两者的一些实现细节的情况下将正确数量的分配内核传递给每个 worker 并不容易snowdoSNOW,具体涉及到doSNOW 使用的clusterApplyLB 函数的实现.当然,如果你碰巧知道每个节点上分配的核数是相同的,这很容易,但如果你想要一个问题的通用解决方案,那就更难了.

We can now easily execute a foreach loop with one task per worker, but it's not so easy to pass the correct number of allocated cores to each of those workers without depending on some implementation details of both snow and doSNOW, specifically relating to the implementation of the clusterApplyLB function used by doSNOW. Of course, it's easy if you happen to know that the number of allocated cores is the same on each node, but it's harder if you want a general solution to the problem.

一个(不太优雅的)通用解决方案是通过 snow clusterApply 函数将分配的内核数量分配给每个 worker 的全局变量:

One (not so elegant) general solution is to assign the number of allocated cores to a global variable on each of the workers via the snow clusterApply function:

setcores <- function(cl, nodes) {
  f <- function(cores) assign('allocated.cores', cores, pos=.GlobalEnv)
  clusterApply(cl, nodes$Freq, f)
}
setcores(cl, nodes)

这保证了每个 worker 上allocated.cores"变量的值等于该节点出现在 PBS_NODEFILE 中的次数.

This guarantees that the value of the "allocated.cores" variable on each of the workers is equal to the number of times that that node appeared in PBS_NODEFILE.

现在我们可以在注册doMC时使用该全局变量:

Now we can use that global variable when registering doMC:

r <- foreach(i=seq_along(nodes$x), .packages='doMC') %dopar% {
  registerDoMC(allocated.cores)
  foreach(j=1:allocated.cores, .combine='c') %dopar% {
    i * j
  }
}

这是一个可用于执行此 R 脚本的示例作业脚本:

Here is an example job script that could be used to execute this R script:

#!/bin/sh
#PBS -l nodes=4:ppn=8
cd "$PBS_O_WORKDIR"
R --slave -f hybridSOCK.R

当通过 qsub 命令提交时,R 脚本将创建一个具有 4 个 worker 的 SOCK 集群,每个 worker 将使用以下命令执行内部 foreach 循环8核.但由于 R 代码是通用的,无论通过 qsub 请求的资源如何,它都应该做正确的事情.

When this is submitted via the qsub command, the R script will create a SOCK cluster with four workers, and each of those workers will execute the inner foreach loop using 8 cores. But since the R code is general, it should do the right thing regardless of the resources requested via qsub.

这篇关于如何使用 Torque/MOAB 调度程序设置 doSNOW 和 SOCK 集群?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆