分布式tensorflow在运行server.__init__后独占GPU [英] Distributed tensorflow monopolizes GPUs after running server.__init__
问题描述
我有两台电脑,每台电脑都有两个 GPU.我正在尝试从分布式 tensorflow 开始,并且对它是如何工作的感到非常困惑.在计算机 A 上,我希望有一个 ps
任务(我觉得这应该在 CPU 上进行)和两个 worker
任务(每个 GPU 一个).我想在计算机 B 上有两个工人"任务.这是我在 test.py
I have two computers with two GPUs each. I am trying to start with distributed tensorflow and very confused about how it all works. On computer A I would like to have one ps
tasks (I have the impression this should go on the CPU) and two worker
tasks (one per GPU). And I would like to have two 'worker' tasks on computer B. Here's how I have tried to implement this, in test.py
import tensorflow as tf
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--job_name', required = True, type = str)
parser.add_argument('--task_idx', required = True, type = int)
args, _ = parser.parse_known_args()
JOB_NAME = args.job_name
TASK_INDEX = args.task_idx
ps_hosts = ["computerB-i9:2222"]
worker_hosts = ["computerA-i7:2222", "computerA-i7:2223", "computerB-i9:2223", "computerB-i9:2224"]
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name = JOB_NAME, task_index = TASK_INDEX)
if JOB_NAME == "ps":
server.join()
elif JOB_NAME == "worker":
is_chief = (TASK_INDEX == 0)
with tf.device(tf.train.replica_device_setter(
worker_device = "/job:worker/task:%d" % FLAGS.task_index, cluster = cluster)):
a = tf.constant(8)
b = tf.constant(9)
with tf.Session(server.target) as sess:
sess.run(tf.multiply(a, b))
我通过在计算机 A 上运行 python3 test.py --job_name ps == task_idx 0
发现的是,我看到计算机 A 上的两个 GPU 都立即被保留脚本和计算机 B 没有显示任何活动.这不是我所期望的.我认为因为对于 ps
作业,我只需运行 server.join()
这不应该使用 GPU.但是我可以通过设置 pdb
断点看到,一旦服务器初始化,GPU 就会被占用.这给我留下了几个问题:
What I am finding by running python3 test.py --job_name ps == task_idx 0
on computer A, is that I see that both GPUs on computer A have immediately been reserved by the script and that computer B shows no activity. This is not what I expected. I thought that since for the ps
job I simply run server.join()
that this should not use the GPU. However I can see by setting pdb
break points that as soon as the server is initialized, the GPUs are taken. This leaves me with several questions:
- 为什么服务器会立即占用所有 GPU 容量?- 我应该如何分配 GPU 并启动不同的进程?- 我原来的计划有意义吗?(我仍然对任务、集群、服务器等有点困惑……)
我在分布式 Tensorflow 上观看了 2017 年 Tensorflow 开发者峰会视频,我也在 Github 和博客上四处浏览.我无法找到使用最新甚至相对较新的分布式 tensorflow 函数的工作代码示例.同样,我注意到 Stack Overflow 上的许多问题都没有得到解答,所以我已经阅读了相关问题,但没有任何可以解决我的问题的问题.我将不胜感激有关其他资源的任何指导或建议.谢谢!
I have watched the Tensorflow Developer Summit 2017 video on distributed Tensorflow and I have also been looking around on Github and blogs. I have not been able to find a working code example using the latest or even relatively recent distributed tensorflow functions. Likewise, I notice that many questions on Stack Overflow are not answered, so I have read related questions but not any that resolve my questions. I would appreciate any guidance or recommendations about other resources. Thanks!
推荐答案
我发现从命令行调用时可以使用以下方法:
I found that the following will work when invoking from command line:
CUDA_VISIBLE_DEVICES="" python3 test.py --job_name ps --task_idx 0 --dir_name TEST
因为我在很多代码示例中发现了这一点,所以这似乎是控制单个服务器对 GPU 资源的访问的标准方法.
Since I found this in a lot of code examples it seems like this may be the standard way to control an individual server's access to GPU resources.
这篇关于分布式tensorflow在运行server.__init__后独占GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!