异步训练在分布式Tensorflow中如何工作? [英] How does asynchronous training work in distributed Tensorflow?

查看:312
本文介绍了异步训练在分布式Tensorflow中如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读分布式Tensorflow文档,其中提到在异步训练中,

I've read Distributed Tensorflow Doc, and it mentions that in asynchronous training,

该图的每个副本都有一个独立的训练循环,无需协调即可执行.

each replica of the graph has an independent training loop that executes without coordination.

据我了解,如果我们将参数服务器与数据并行性体系结构一起使用,则意味着每个工作人员都可以计算梯度并更新自己的权重,而无需关心分布式培训神经网络的其他工作人员更新.由于所有权重都在参数服务器(ps)上共享,我认为ps仍必须以某种方式协调(或汇总)所有工作人员的权重更新.我想知道聚合在异步训练中如何工作.或更笼统地说,异步培训如何在分布式Tensorflow中工作?

From what I understand, if we use parameter-server with data parallelism architecture, it means each worker computes gradients and updates its own weights without caring about other workers updates for distributed training Neural Network. As all weights are shared on parameter server (ps), I think ps still has to coordinate (or aggregate) weight updates from all workers in some way. I wonder how does the aggregation work in asynchronous training. Or in more general words, how does asynchronous training work in distributed Tensorflow?

推荐答案

在分布式TensorFlow中进行异步训练时,特定的工作人员会执行以下操作:

When you train asynchronously in Distributed TensorFlow, a particular worker does the following:

  1. 工作器从PS任务并行读取所有共享模型参数,并将它们复制到工作器任务.这些读取与任何并发写入均不协调,并且不会获取任何锁:尤其是该工作程序可能会看到一个或多个其他工作程序的部分更新(例如,可能已应用了来自另一工作程序的更新的子集,或元素的一个子集)变量中的值可能已更新).

  1. The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).

工作者根据一批输入数据及其在步骤1中读取的参数值,在本地计算梯度.

The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.

工作者将确定的更新规则发送每个变量的梯度到适当的PS任务,并应用到其各自的变量通过优化算法(例如SGD,带有动量的SGD,Adagrad,Adam等).更新规则通常使用(大约)可交换操作,因此它们可以独立地应用于来自每个工作程序的更新,并且每个变量的状态将是接收到的更新序列的运行汇总. /p>

The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.

在异步训练中,来自工作程序的每个更新是同时应用的,并且如果在相应的优化程序(例如use_locking=True标志,则更新可能会有所协调. .org/api_docs/python/tf/train/GradientDescentOptimizer"rel =" noreferrer> tf.train.GradientDescentOptimizer )已初始化.但是请注意,此处的锁定仅提供两个并发更新的互斥,并且(如上所述)读取不获取锁定;锁定不会在整个更新集中提供原子性.

In asynchronous training, each update from the worker is applied concurrently, and the updates may be somewhat coordinated if the optional use_locking=True flag was set when the respective optimizer (e.g. tf.train.GradientDescentOptimizer) was initialized. Note however that the locking here only provides mutual exclusion for two concurrent updates, and (as noted above) reads do not acquire locks; the locking does not provide atomicity across the entire set of updates.

(相反,在同步训练中,类似tf.train.SyncReplicasOptimizer的实用程序将确保所有工人均读取每个模型参数的相同最新值;并且对于同步步骤的所有更新均应在将其应用于基础变量之前先进行汇总.为此,工作人员通过屏障进行同步,他们在发送梯度更新后进入屏障,并在将汇总更新应用于所有变量后离开.)

(By contrast, in synchronous training, a utility like tf.train.SyncReplicasOptimizer will ensure that all of the workers read the same, up-to-date values for each model parameter; and that all of the updates for a synchronous step are aggregated before they are applied to the underlying variables. To do this, the workers are synchronized by a barrier, which they enter after sending their gradient update, and leave after the aggregated update has been applied to all variables.)

这篇关于异步训练在分布式Tensorflow中如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆