分布式 Tensorflow 中的异步训练如何工作? [英] How does asynchronous training work in distributed Tensorflow?

查看:40
本文介绍了分布式 Tensorflow 中的异步训练如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了分布式 Tensorflow 文档,它提到在异步训练中,><块引用>

图的每个副本都有一个独立的训练循环,无需协调即可执行.

据我所知,如果我们使用具有数据并行架构的参数服务器,这意味着每个工作人员都会计算梯度并更新自己的权重,而无需关心其他工作人员对分布式训练神经网络的更新.由于所有权重都在参数服务器 (ps) 上共享,我认为 ps 仍然必须以某种方式协调(或聚合)所有工作人员的权重更新.我想知道聚合在异步训练中是如何工作的.或者更笼统地说,异步训练在分布式 Tensorflow 中是如何工作的?

解决方案

当您在分布式 TensorFlow 中进行异步训练时,特定工作人员会执行以下操作:

  1. worker 读取所有来自 PS 任务的共享模型参数,并将它们复制到 worker 任务中.这些读取与任何并发写入不协调,并且不会获取锁:特别是工作人员可能会看到来自一个或多个其他工作人员的部分更新(例如可能已应用来自另一工作人员的更新的子集,或者元素的子集在一个变量中可能已被更新).

  2. worker 根据一批输入数据和它在步骤 1 中读取的参数值,在本地计算梯度.

  3. 工作者发送每个变量的梯度到适当的 PS 任务,并应用梯度到他们各自的变量,使用一个确定的更新规则通过优化算法(例如 SGD、SGD with Momentum、Adagrad、Adam 等).更新规则通常使用(大约)可交换操作,因此它们可以独立应用于来自每个工作程序的更新,并且每个变量的状态将是收到的更新序列的运行聚合.

在异步训练中,来自工作者的每个更新都是并发应用的,如果在相应的优化器(例如 tf.train.GradientDescentOptimizer) 已初始化.但是请注意,这里的锁定仅为两个并发更新提供互斥,并且(如上所述)读取不获取锁定;锁定不提供整个更新集的原子性.

(相比之下,在同步训练中,像 tf.train.SyncReplicasOptimizer 这样的实用程序将确保所有工作人员读取每个模型参数的相同、最新值;并且同步步骤的所有更新在应用到底层变量之前都被聚合.为此,工作人员通过一个屏障同步,他们在发送梯度更新后进入,并在聚合更新应用后离开所有变量.)

I've read Distributed Tensorflow Doc, and it mentions that in asynchronous training,

each replica of the graph has an independent training loop that executes without coordination.

From what I understand, if we use parameter-server with data parallelism architecture, it means each worker computes gradients and updates its own weights without caring about other workers updates for distributed training Neural Network. As all weights are shared on parameter server (ps), I think ps still has to coordinate (or aggregate) weight updates from all workers in some way. I wonder how does the aggregation work in asynchronous training. Or in more general words, how does asynchronous training work in distributed Tensorflow?

解决方案

When you train asynchronously in Distributed TensorFlow, a particular worker does the following:

  1. The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).

  2. The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.

  3. The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.

In asynchronous training, each update from the worker is applied concurrently, and the updates may be somewhat coordinated if the optional use_locking=True flag was set when the respective optimizer (e.g. tf.train.GradientDescentOptimizer) was initialized. Note however that the locking here only provides mutual exclusion for two concurrent updates, and (as noted above) reads do not acquire locks; the locking does not provide atomicity across the entire set of updates.

(By contrast, in synchronous training, a utility like tf.train.SyncReplicasOptimizer will ensure that all of the workers read the same, up-to-date values for each model parameter; and that all of the updates for a synchronous step are aggregated before they are applied to the underlying variables. To do this, the workers are synchronized by a barrier, which they enter after sending their gradient update, and leave after the aggregated update has been applied to all variables.)

这篇关于分布式 Tensorflow 中的异步训练如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆