tf.train.SyncReplicasOptimizer是否多次完成将参数从聚合梯度更新为值的更新? [英] Does tf.train.SyncReplicasOptimizer do complete parameter update from aggregated gradients to value for many times?

查看:175
本文介绍了tf.train.SyncReplicasOptimizer是否多次完成将参数从聚合梯度更新为值的更新?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

/model/inception/inception/inception_distributed_training.py 中,为每个工人调用apply_gradients。

In /model/inception/inception/inception_distributed_training.py apply_gradients are called for each worker.

apply_gradients_op = opt.apply_gradients(grads, global_step=global_step) 

并进入SyncReplicasOptimizer.py:

and go into SyncReplicasOptimizer.py:

  285       # sync_op will be assigned to the same device as the global step.
  286       with ops.device(global_step.device), ops.name_scope(""):
  287         update_op = self._opt.apply_gradients(aggregated_grads_and_vars,
  288                                               global_step)
  289

第287行将由ps设备上的每个工作进程执行。

line 287 are will be executed by each worker process at ps device.

我认为,即使聚合所有副本渐变的作业也只能工作一次,但是一旦聚合作业完成,所有副本将rpc调用远程apply_gradients操作组以生成下一个可变值。如果是这样,可以通过检查is_chief标志来消除重复的apply_gradients。

I think, even the job that aggregating all replicas gradients only works for one time, but once aggregating job finished, all replicas will rpc calls remote apply_gradients operations group to generate next variable value. If that's real, the duplicated apply_gradients can be eliminated by checking is_chief flag.

顺便提两个问题:


  • 如何控制如果进行多个更新操作,则排他变量缓冲区访问?

  • How to control the exclusive variable buffer access if multiple updating operations come?

我们可以使用 caching_device标志来消除多个远程变量访问(多个网络通信过程)吗?如果可以,如果ps上的变量已更新,如何触发更新(无效)缓存的变量?

Can we use "caching_device" flag to eliminate multiple remote variable access (multiple network communication procedure) ? if that's ok, how to trigger update(invalid) cached variable if variables on ps are updated?

阅读了大量文档并进行了大量实验以对其进行验证,但是官方的回答可能会再次受到高度赞赏。

I have carefully read lots of documents and done lots of experiments to verify it, but official answer could be highly appreciated again.

推荐答案

在仔细仔细阅读了这些代码段后,我想由我自己回答。

I would like to answer it by myself after carefully and carefully review these code snippets.

```

  310       with ops.device(global_step.device), ops.name_scope(""):
  311         # Replicas have to wait until they can get a token from the token queue.
  312         with ops.control_dependencies(train_ops):
  313           token = sync_token_queue.dequeue()
  314         train_op = state_ops.assign(self._local_step, token)
  315
  316         with ops.control_dependencies([update_op]):
  317           # Sync_op needs to insert tokens to the token queue at the end of the
  318           # step so the replicas can fetch them to start the next step.
  319           tokens = array_ops.fill([self._tokens_per_step], global_step)
  320           sync_op = sync_token_queue.enqueue_many((tokens,))
  321
  322         if self._variable_averages is not None:
  323           with ops.control_dependencies([sync_op]), ops.name_scope(""):
  324             sync_op = self._variable_averages.apply(
  325                 self._variables_to_average)
  326
  327         self._chief_queue_runner = queue_runner.QueueRunner(dummy_queue,
  328                                                             [sync_op])

```

有两个操作集,train_ops和update_op。 update_op以sync_op结尾。 sync_op将由QueueRunner返回为self._chief_queue_runner执行。在每个工人的上下文中,train_ops以train_op结尾。

There two ops set, train_ops and update_op. update_op ends up with sync_op. the 'sync_op' will be executed by QueueRunner return as self._chief_queue_runner. The train_ops is end up with train_op called at the context of each worker.

作为一个简单的结论,为主工作者返回sync_op以进行参数更新(由所有ps完成,实际上,主工作者只是控制同步机制)。 train_op由每个工作人员调用。

As a brief conclusion, sync_op are returned for the chief worker to do parameter updating(done by all ps, in reality, the chief worker just do control synchronisation mechanism). train_op are called by each worker.

这只是一次更新操作,没有重复的更新。

That's the updating action only works one time, no duplicated updating.

仅此而已。

这篇关于tf.train.SyncReplicasOptimizer是否多次完成将参数从聚合梯度更新为值的更新?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆