分布式tensorflow在运行server.init后独占GPU [英] Distributed tensorflow monopolizes GPUs after running server.init

查看：22 发布时间：2021/9/5 19:33:17 python tensorflow

本文介绍了分布式tensorflow在运行server.__init__后独占GPU的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两台电脑，每台电脑都有两个 GPU.我正在尝试从分布式 tensorflow 开始，并且对它是如何工作的感到非常困惑.在计算机 A 上，我希望有一个 ps 任务(我觉得这应该在 CPU 上进行)和两个 worker 任务(每个 GPU 一个).我想在计算机 B 上有两个工人"任务.这是我在 test.py

I have two computers with two GPUs each. I am trying to start with distributed tensorflow and very confused about how it all works. On computer A I would like to have one ps tasks (I have the impression this should go on the CPU) and two worker tasks (one per GPU). And I would like to have two 'worker' tasks on computer B. Here's how I have tried to implement this, in test.py

import tensorflow as tf
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--job_name',   required = True,            type = str)
parser.add_argument('--task_idx',   required = True,            type = int)
args, _    = parser.parse_known_args()
JOB_NAME   = args.job_name
TASK_INDEX = args.task_idx


ps_hosts     = ["computerB-i9:2222"]
worker_hosts = ["computerA-i7:2222", "computerA-i7:2223", "computerB-i9:2223", "computerB-i9:2224"]
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server  = tf.train.Server(cluster, job_name = JOB_NAME, task_index = TASK_INDEX)

if JOB_NAME == "ps":
    server.join()
elif JOB_NAME == "worker":
    is_chief = (TASK_INDEX == 0)    

    with tf.device(tf.train.replica_device_setter(
            worker_device = "/job:worker/task:%d" % FLAGS.task_index, cluster = cluster)):

        a = tf.constant(8)
        b = tf.constant(9)

    with tf.Session(server.target) as sess:
        sess.run(tf.multiply(a, b))

我通过在计算机 A 上运行 python3 test.py --job_name ps == task_idx 0 发现的是，我看到计算机 A 上的两个 GPU 都立即被保留脚本和计算机 B 没有显示任何活动.这不是我所期望的.我认为因为对于 ps 作业，我只需运行 server.join() 这不应该使用 GPU.但是我可以通过设置 pdb 断点看到，一旦服务器初始化，GPU 就会被占用.这给我留下了几个问题:

What I am finding by running python3 test.py --job_name ps == task_idx 0 on computer A, is that I see that both GPUs on computer A have immediately been reserved by the script and that computer B shows no activity. This is not what I expected. I thought that since for the ps job I simply run server.join() that this should not use the GPU. However I can see by setting pdb break points that as soon as the server is initialized, the GPUs are taken. This leaves me with several questions:

- 为什么服务器会立即占用所有 GPU 容量?- 我应该如何分配 GPU 并启动不同的进程?- 我原来的计划有意义吗?(我仍然对任务、集群、服务器等有点困惑……)

我在分布式 Tensorflow 上观看了 2017 年 Tensorflow 开发者峰会视频，我也在 Github 和博客上四处浏览.我无法找到使用最新甚至相对较新的分布式 tensorflow 函数的工作代码示例.同样，我注意到 Stack Overflow 上的许多问题都没有得到解答，所以我已经阅读了相关问题，但没有任何可以解决我的问题的问题.我将不胜感激有关其他资源的任何指导或建议.谢谢！

I have watched the Tensorflow Developer Summit 2017 video on distributed Tensorflow and I have also been looking around on Github and blogs. I have not been able to find a working code example using the latest or even relatively recent distributed tensorflow functions. Likewise, I notice that many questions on Stack Overflow are not answered, so I have read related questions but not any that resolve my questions. I would appreciate any guidance or recommendations about other resources. Thanks!

分布式tensorflow在运行server.init后独占GPU [英] Distributed tensorflow monopolizes GPUs after running server.init

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

分布式tensorflow在运行server.__init__后独占GPU [英] Distributed tensorflow monopolizes GPUs after running server.__init__

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

分布式tensorflow在运行server.init后独占GPU [英] Distributed tensorflow monopolizes GPUs after running server.init

登录关闭