Tensorflow:在分布式训练中使用参数服务器 [英] Tensorflow: Using Parameter Servers in Distributed Training

查看:77
本文介绍了Tensorflow:在分布式训练中使用参数服务器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前尚不清楚参数服务器如何知道在分布式张量流训练中该做什么.

It's not totally clear how parameter servers know what to do in a distributed tensor flow training.

例如,在这个SO问题中,以下代码用于配置参数 server 和 worker 任务:

For example, in this SO question, the following code is used to configure parameter server and worker tasks:

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    ##some training code

server.join() 如何指示给定的任务应该是参数服务器?参数是否为任务提供了一种默认行为?您还有什么可以/应该告诉参数服务任务执行的操作吗?

How does server.join() indicate the given task should be a parameter server? Is parameter serving a kind of default behavior for tasks? Is there anything else you can/should tell a parameter serving task to do?

编辑:这个SO问题解决我的一些问题:那里的逻辑确保将变量对象平均分配给充当参数服务器的工作人员."但是参数服务器如何知道它是参数服务器呢?server.join() 够了吗?

Edit: This SO question addresses some of my question: "The logic there makes sure that Variable objects are assigned evenly to workers that act as parameter servers." But how does a parameter server know it is a parameter server? Is server.join() enough?

推荐答案

TL;DR: TensorFlow 对参数服务器"一无所知,但它支持在多个设备上运行图不同的过程.其中一些进程的设备名称以 "/job:ps" 开头,这些设备保存变量.工作人员推动训练过程,当他们运行 train_op 时,他们将在 "/job:ps" 设备上进行工作,这将更新共享变量.

TL;DR: TensorFlow doesn't know anything about "parameter servers", but instead it supports running graphs across multiple devices in different processes. Some of these processes have devices whose names start with "/job:ps", and these hold the variables. The workers drive the training process, and when they run the train_op they will cause work to happen on the "/job:ps" devices, which will update the shared variables.

server.join() 方法只是告诉 TensorFlow 阻塞并监听请求,直到服务器关闭(目前这意味着它永远阻塞,或者直到你终止进程,因为当前没有实现干净关闭).

The server.join() method simply tells TensorFlow to block and listen for requests until the server shuts down (which currently means it blocks forever, or until you kill the process, since clean shutdown isn't currently implemented).

在我之前回答的示例中,PS任务是被动的,一切都由工作任务控制...在##一些训练代码中.如果您将代码拆分到多个设备上,TensorFlow 将添加适当的通信,这会扩展到不同进程中的设备.with tf.device(tf.train.replica_device_setter(...)): 块告诉 TensorFlow 通过将其设备设置为 "/job:ps 来将每个变量放在不同的 PS 任务上/task:{i}" (对于 {i} 的不同值,以循环方式选择).

In the example in my previous answer, the PS tasks are passive, and everything is controlled by the worker tasks... in ## some training code. If you split your code across multiple devices, TensorFlow will add the appropriate communication, and this extends to devices in different processes. The with tf.device(tf.train.replica_device_setter(...)): block tells TensorFlow to put each variable on a different PS task by setting its device to "/job:ps/task:{i}" (for different values of {i}, chosen in a round-robin fashion).

当您调用 sess.run(train_op) 时,TensorFlow 将运行一个依赖和更新变量的图,并包括更新它们的操作.这部分计算将在 "/job:ps" 设备上进行,因此这些设备将充当参数服务器.

When you call sess.run(train_op), TensorFlow will run a graph that depends on and updates the variables, and includes the operations that update them. This part of the computation will happen on the "/job:ps" devices, so those devices will act like a parameter server.

这篇关于Tensorflow:在分布式训练中使用参数服务器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆