在Slurm集群上运行TensorFlow? [英] Running TensorFlow on a Slurm Cluster?

查看:419
本文介绍了在Slurm集群上运行TensorFlow?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以访问一个计算集群,特别是一个带有两个12核CPU的节点,该集群运行 Slurm Workload Manager

I could get access to a computing cluster, specifically one node with two 12-Core CPUs, which is running with Slurm Workload Manager.

我想运行 TensorFlow 在该系统上,但不幸的是,我无法找到有关如何执行此操作甚至可能的任何信息。我对此并不陌生,但据我了解,我将必须通过创建Slurm作业来运行TensorFlow,并且无法通过ssh直接执行python / tensorflow。

I would like to run TensorFlow on that system but unfortunately I were not able to find any information about how to do this or if this is even possible. I am new to this but as far as I understand it, I would have to run TensorFlow by creating a Slurm job and can not directly execute python/tensorflow via ssh.

有人对此主题有任何想法,教程或任何来源吗?

Has anyone an idea, tutorial or any kind of source on this topic?

推荐答案

这是相对简单的。

在简化的假设下,每台主机请求一个进程,slurm将为您提供有关环境变量的所有信息,

Under the simplifying assumptions that you request one process per host, slurm will provide you with all the information you need in environment variables, specifically SLURM_PROCID, SLURM_NPROCS and SLURM_NODELIST.

例如,您可以按以下方式初始化任务索引,任务数和节点列表:

For example, you can initialize your task index, the number of tasks and the nodelist as follows:

from hostlist import expand_hostlist
task_index  = int( os.environ['SLURM_PROCID'] )
n_tasks     = int( os.environ['SLURM_NPROCS'] )
tf_hostlist = [ ("%s:22222" % host) for host in
                expand_hostlist( os.environ['SLURM_NODELIST']) ]  

请注意,slurm会以压缩格式(例如 myhost [11-99])为您提供主机列表,您需要对其进行扩展。我使用
KentEngström的模块主机列表来做到这一点,可在此处 https://pypi.python .org / pypi / python-hostlist

Note that slurm gives you a host list in its compressed format (e.g., "myhost[11-99]"), that you need to expand. I do that with module hostlist by Kent Engström, available here https://pypi.python.org/pypi/python-hostlist

此时,您可以继续并使用可用信息创建TensorFlow集群规范和服务器,例如:

At that point, you can go right ahead and create your TensorFlow cluster specification and server with the information you have available, e.g.:

cluster = tf.train.ClusterSpec( {"your_taskname" : tf_hostlist } )
server  = tf.train.Server( cluster.as_cluster_def(),
                           job_name   = "your_taskname",
                           task_index = task_index )

然后您就定了!您现在可以使用通常的语法在分配的特定主机上执行TensorFlow节点放置:

And you're set! You can now perform TensorFlow node placement on a specific host of your allocation with the usual syntax:

for idx in range(n_tasks):
   with tf.device("/job:your_taskname/task:%d" % idx ):
       ...

上面报告的代码的一个缺陷是您的所有作业都将指示Tensorflow安装在固定端口22222上侦听的服务器。如果多个此类作业恰好排定在同一节点,则第二个一个人不会听22222。

A flaw with the code reported above is that all your jobs will instruct Tensorflow to install servers listening at fixed port 22222. If multiple such jobs happen to be scheduled to the same node, the second one will fail to listen to 22222.

更好的解决方案是让每个任务都保留端口。您需要让Slurm管理员加入,并要求他配置Slurm,以便它允许使用--resv-ports选项来请求端口。在实践中,这要求他们在slurm.conf中添加以下内容:

A better solution is to let slurm reserve ports for each job. You need to bring your slurm administrator on board and ask him to configure slurm so it allows you to ask for ports with the --resv-ports option. In practice, this requires asking them to add a line like the following in their slurm.conf:

MpiParams=ports=15000-19999

在对Slurm管理员进行错误调试之前,请检查已配置了哪些选项,例如: p>

Before you bug your slurm admin, check what options are already configured, e.g., with:

scontrol show config | grep MpiParams

如果您的站点已经使用了旧版本的OpenMPI,则可能有这样的选择

If your site already uses an old version of OpenMPI, there's a chance an option like this is already in place.

然后,如下修改我的第一段代码:

Then, amend my first snippet of code as follows:

from hostlist import expand_hostlist
task_index  = int( os.environ['SLURM_PROCID'] )
n_tasks     = int( os.environ['SLURM_NPROCS'] )
port        = int( os.environ['SLURM_STEP_RESV_PORTS'].split('-')[0] )
tf_hostlist = [ ("%s:%s" % (host,port)) for host in
                expand_hostlist( os.environ['SLURM_NODELIST']) ]  

祝你好运!

这篇关于在Slurm集群上运行TensorFlow?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆