分布式tensorflow在运行server.__init__后独占GPU [英] Distributed tensorflow monopolizes GPUs after running server.__init__

查看:22
本文介绍了分布式tensorflow在运行server.__init__后独占GPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两台电脑,每台电脑都有两个 GPU.我正在尝试从分布式 tensorflow 开始,并且对它是如何工作的感到非常困惑.在计算机 A 上,我希望有一个 ps 任务(我觉得这应该在 CPU 上进行)和两个 worker 任务(每个 GPU 一个).我想在计算机 B 上有两个工人"任务.这是我在 test.py

I have two computers with two GPUs each. I am trying to start with distributed tensorflow and very confused about how it all works. On computer A I would like to have one ps tasks (I have the impression this should go on the CPU) and two worker tasks (one per GPU). And I would like to have two 'worker' tasks on computer B. Here's how I have tried to implement this, in test.py

import tensorflow as tf
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--job_name',   required = True,            type = str)
parser.add_argument('--task_idx',   required = True,            type = int)
args, _    = parser.parse_known_args()
JOB_NAME   = args.job_name
TASK_INDEX = args.task_idx


ps_hosts     = ["computerB-i9:2222"]
worker_hosts = ["computerA-i7:2222", "computerA-i7:2223", "computerB-i9:2223", "computerB-i9:2224"]
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server  = tf.train.Server(cluster, job_name = JOB_NAME, task_index = TASK_INDEX)

if JOB_NAME == "ps":
    server.join()
elif JOB_NAME == "worker":
    is_chief = (TASK_INDEX == 0)    

    with tf.device(tf.train.replica_device_setter(
            worker_device = "/job:worker/task:%d" % FLAGS.task_index, cluster = cluster)):

        a = tf.constant(8)
        b = tf.constant(9)

    with tf.Session(server.target) as sess:
        sess.run(tf.multiply(a, b))

我通过在计算机 A 上运行 python3 test.py --job_name ps == task_idx 0 发现的是,我看到计算机 A 上的两个 GPU 都立即被保留脚本和计算机 B 没有显示任何活动.这不是我所期望的.我认为因为对于 ps 作业,我只需运行 server.join() 这不应该使用 GPU.但是我可以通过设置 pdb 断点看到,一旦服务器初始化,GPU 就会被占用.这给我留下了几个问题:

What I am finding by running python3 test.py --job_name ps == task_idx 0 on computer A, is that I see that both GPUs on computer A have immediately been reserved by the script and that computer B shows no activity. This is not what I expected. I thought that since for the ps job I simply run server.join() that this should not use the GPU. However I can see by setting pdb break points that as soon as the server is initialized, the GPUs are taken. This leaves me with several questions:

- 为什么服务器会立即占用所有 GPU 容量?- 我应该如何分配 GPU 并启动不同的进程?- 我原来的计划有意义吗?(我仍然对任务、集群、服务器等有点困惑……)

我在分布式 Tensorflow 上观看了 2017 年 Tensorflow 开发者峰会视频,我也在 Github 和博客上四处浏览.我无法找到使用最新甚至相对较新的分布式 tensorflow 函数的工作代码示例.同样,我注意到 Stack Overflow 上的许多问题都没有得到解答,所以我已经阅读了相关问题,但没有任何可以解决我的问题的问题.我将不胜感激有关其他资源的任何指导或建议.谢谢!

I have watched the Tensorflow Developer Summit 2017 video on distributed Tensorflow and I have also been looking around on Github and blogs. I have not been able to find a working code example using the latest or even relatively recent distributed tensorflow functions. Likewise, I notice that many questions on Stack Overflow are not answered, so I have read related questions but not any that resolve my questions. I would appreciate any guidance or recommendations about other resources. Thanks!

推荐答案

我发现从命令行调用时可以使用以下方法:

I found that the following will work when invoking from command line:

CUDA_VISIBLE_DEVICES="" python3 test.py --job_name ps --task_idx 0 --dir_name TEST

因为我在很多代码示例中发现了这一点,所以这似乎是控制单个服务器对 GPU 资源的访问的标准方法.

Since I found this in a lot of code examples it seems like this may be the standard way to control an individual server's access to GPU resources.

这篇关于分布式tensorflow在运行server.__init__后独占GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆