Google Cloud ML引擎中的分布式Tensorflow设备放置 [英] Distributed Tensorflow device placement in Google Cloud ML engine

查看:88
本文介绍了Google Cloud ML引擎中的分布式Tensorflow设备放置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Google Cloud ML引擎中运行大型分布式Tensorflow模型.我想使用带有GPU的机器. 我的图由输入/数据读取器功能和计算部分两个主要部分组成.

I am running a large distributed Tensorflow model in google cloud ML engine. I want to use machines with GPUs. My graph consists of two main the parts the input/data reader function and the computation part.

我希望将变量放置在PS任务中,将输入部分放置在CPU中,将计算部分放置在GPU中. 函数tf.train.replica_device_setter自动将变量放置在PS服务器中.

I wish to place variables in the PS task, the input part in the CPU and the computation part on the GPU. The function tf.train.replica_device_setter automatically places variables in the PS server.

这是我的代码的样子:

with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
    input_tensors = model.input_fn(...)
    output_tensors = model.model_fn(input_tensors, ...)

是否可以将tf.device()replica_device_setter()一起使用,如:

Is it possible to use tf.device() together with replica_device_setter() as in:

with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
    with tf.device('/cpu:0')
        input_tensors = model.input_fn(...)
    with tf.device('/gpu:0')
        tensor_dict = model.model_fn(input_tensors, ...)

replica_divice_setter()是否会被覆盖并且变量不会放置在PS服务器中?

Will the replica_divice_setter() be overridden and variables not placed in the PS server?

此外,由于集群中的设备名称类似于job:master/replica:0/task:0/gpu:0,我如何对Tensorflow tf.device(whatever/gpu:0)说呢?

Furthermore, since the device names in the cluster are something like job:master/replica:0/task:0/gpu:0 how do I say to Tensorflow tf.device(whatever/gpu:0)?

推荐答案

tf.train.replica_device_setter 块会自动固定到"/job:worker",这将默认为由"worker"作业中第一个任务管理的第一个设备.

Any operations, beyond variables, in the tf.train.replica_device_setter block are automatically pinned to "/job:worker", which will default to the first device managed by the first task in the "worker" job.

可以使用嵌入式设备块将它们固定到另一个设备(或任务):

You can pin them to another device (or task) by using embedded device block:

with tf.device(tf.train.replica_device_setter(ps_tasks=2, ps_device="/job:ps", 
                                          worker_device="/job:worker")):
  v1 = tf.Variable(1., name="v1")  # pinned to /job:ps/task:0 (defaults to /cpu:0)
  v2 = tf.Variable(2., name="v2")  # pinned to /job:ps/task:1 (defaults to /cpu:0)
  v3 = tf.Variable(3., name="v3")  # pinned to /job:ps/task:0 (defaults to /cpu:0)
  s = v1 + v2            # pinned to /job:worker (defaults to task:0/cpu:0)
  with tf.device("/task:1"):
    p1 = 2 * s           # pinned to /job:worker/task:1 (defaults to /cpu:0)
    with tf.device("/cpu:0"):
      p2 = 3 * s         # pinned to /job:worker/task:1/cpu:0

这篇关于Google Cloud ML引擎中的分布式Tensorflow设备放置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆