分布式TensorFlow [异步,图形间复制]:这是工作程序和服务器之间有关变量更新的确切交互 [英] Distributed TensorFlow [Async, Between-Graph Replication]: which are the exactly interaction between workers and servers regarding Variables update

查看:123
本文介绍了分布式TensorFlow [异步,图形间复制]:这是工作程序和服务器之间有关变量更新的确切交互的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读分布式TensorFlow文档有关StackOverflow的问题,但我仍然对TensorFlow及其参数服务器体系结构可以进行的分布式培训背后的动态性仍有疑问。
这是来自分布式TensorFlow Doc的一段代码:

I've read Distributed TensorFlow Doc and this question on StackOverflow but I still have some doubt about the dynamics behind the distributed training that can be done with TensorFlow and its Parameter Server Architecture. This is a snipped of code from the Distributed TensorFlow Doc:

if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":

    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):

      # Build model...
      loss = ...
      global_step = tf.contrib.framework.get_or_create_global_step()

      train_op = tf.train.AdagradOptimizer(0.01).minimize(
          loss, global_step=global_step)

这是答案的一部分我读到的StackOverflow问题:

And here part of the answer of the StackOverflow question that I read:


worker从
并行读取PS任务的所有共享模型参数,并复制它们工人的任务。这些读取是
,与任何并发写入均不协调,并且没有获得锁:特别是
,该工作程序可能会看到一个或多个
其他工作程序的部分更新(例如,来自另一工作程序的一部分更新) worker可能已经应用了
,或者变量中的一部分子集可能已经更新了
。)

The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).

worker根据批次在本地计算梯度
输入数据及其在步骤1中读取的参数值。

The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.


工作人员将每个变量的梯度发送到相应的PS
任务,然后使用由优化算法确定的
更新规则(例如,
SGD,带有Momentum的SGD,Adagrad,Adam等)将梯度应用于相应的变量。更新规则
通常使用(大约)交换操作,因此它们可能是
独立应用于来自每个工作程序的更新,并且每个变量的状态
将是序列的运行汇总收到的
个更新。

The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.

我必须在另一个环境中重现这种参数服务器架构,并且我需要深入了解worker和PS如何在TensorFlow框架中,任务彼此交互。
我的问题是,PS任务在从工作人员那里收到价值后会执行某种合并或更新操作吗?或者它只是存储最新的价值?仅仅存储最新的值是否合理?查看TensorFlow文档中的代码,我发现PS任务只是执行一个join(),而且我想知道此方法调用背后是PS任务的完整行为。

I have to reproduce this kind of parameter server architecture in another environment and I need to deeply understand how workers and PS tasks interact with each other inside the TensorFlow framework. My question is, does the PS task do some kind of merging or updating operation after receiving the value from the workers or it just store the newest value ? Can be something reasonable just storing the newest value ? Looking at the code from the TensorFlow documentation I see that the PS task just do a join() and I wonder behind this method call which are the complete behaviour of the PS task.

还有一个问题,计算梯度和应用梯度之间有什么区别?

One more question, what is the difference between compute a gradient and apply a gradient ?

推荐答案

让我们以相反的顺序开始,从最后一个开始问题:计算梯度和应用梯度之间有什么区别?

Let's go in reverse order and start from your last question: what is the difference between compute a gradient and apply a gradient?

计算梯度意味着运行梯度在计算了损失之后,向后传递网络。对于梯度下降,这意味着估算下面公式中的 gradients 值(注意:这是计算梯度实际需要的巨大简化形式, (有关反向传播和梯度下降的更多信息,请对其进行详细说明)。 应用渐变意味着根据您刚刚计算的渐变来更新参数。对于梯度下降,这(大致)意味着执行以下操作:

Computing the gradients means running the backward pass on the network, after having computed the loss. For gradient descent, this means estimating the gradients value in the formula beneath (note: this is a huge simplification of what computing gradients actually entails, look up more about backpropagation and gradient descent fora proper explanation of how this works). Applying the gradients means updating the parameters according to the gradients you just computed. For gradient descent, this (roughly) means executing the following:

weights = weights - (learning_step * gradients)

请注意,根据 learning_step 的值, 权重的值取决于先前的值和计算出的权重。

Note that, depending on the value of learning_step, the new value of weights depends on both the previous value and the computed weights.

记住这一点,更容易了解PS /工作者架构。让我们做一个简化的假设,即只有一个PS(我们将在后面看到如何扩展到多PS)

With this in mind, it's easier to understand the PS/worker architecture. Let's make the simplifying assumption that there is only one PS (we'll see later how to extend to multi-PS)

一个PS(参数服务器)将权重(即参数)保留在内存中,并接收梯度,运行我在其中写的更新步骤上面的代码。

A PS (parameter server) keeps in memory the weights (i.e. the parameters) and receives gradients, running the update step I wrote in the code above. It does this every time it receives gradients from a worker.

另一方面,一个工人查找 current 的值是什么< PS中的code> weights ,在本地复制它,对一批数据运行网络的正向和反向传递,并获得新的梯度,然后将其发送回PS。

A worker, on the other hand, looks up what's the current value of weights in the PS, makes a copy of it locally, runs a forward and a backward pass of the network on a batch of data and gets new gradients, which then sends back to the PS.

请注意对当前的强调:没有锁定或进程间同步在工人和PS之间。如果工作人员在更新过程中读取重量(例如,一半已经具有新值,而另一半仍在更新中),那就是他将要获得的权重用于下一次迭代。

Note the emphasis on "current": there is no locking or inter-process synchronization between workers and the PS. If a worker reads weights in the middle of an update (and, for example, half already have the new value and half are still being updated), that's the weights he'll use for the next iteration. This keeps things fast.

如果有更多PS,该怎么办?!网络的参数在PS之间进行了分区,工作人员只需联系它们以获取每个参数块的新值,然后仅发送与每个特定PS相关的梯度。

What if there's more PSs? No problem! The parameters of the network are partitioned among the PSs, the worker simply contacts all of them to get the new values for each chunk of the parameters and sends back only the gradients relevant to each specific PS.

这篇关于分布式TensorFlow [异步,图形间复制]:这是工作程序和服务器之间有关变量更新的确切交互的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆