如何在分布式张量流中的图内副本中的远程工作者跨度迭代中生成数据? [英] How to make the generated data in remote worker span iterations in in-graph replica in distributed tensorflow?

查看:178
本文介绍了如何在分布式张量流中的图内副本中的远程工作者跨度迭代中生成数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用tensorflow的图内复制来进行分布式培训。为了减少通信成本的目的,我需要在一次训练迭代中将一些生成的数据(例如LSTM中的单元状态)放在一些远程工作人员的下一次迭代中,但是我发现我无法实现它。



如果我使用'session.run'接口的提取操作来检索一个远程工作人员生成的数据,并在下一次训练迭代中将数据提供给此远程工作人员,则不必要的网络成本生成,如下面的代码所示:

pre $ cluster = tf.train.ClusterSpec({worker:[remoteIP0:port ,remoteIP1:port]})
...

用于xrange(2)中的i:
with tf.device(/ woker:local / task:% d%i):
与tf.name_scope('%s_%d'%(TOWER_NAME,i))作为范围:
#执行构建模型副本的代码和一个维护
# 步。
...
initial_state [i] = ...
...
weight [i] = ...
bias [i] = ...
cost [i] = ...
...
gradient [i] =
final_state [i] =
...
grad = aggregate_func(gradient [0],gradient [1])$ ​​b $ b optimizer = tf.train.GradientDescentOptimizer(lr)
train_op = optimizer.apply_gradient(grad)

...
with tf.Session(grpc:// localhost:port)as session:
...
for xrange(max_step):
cost,finalstate,_ = session.run([cost,final_state,train_op],
{initial_state:finalstate})
...

在迭代 k 中生成的'final_state [i]'需要在迭代 k + 1 中分配给'initial_state [i]'每个远程工作者
我们如何在远程工作人员机器上完成分配,而无需获取主机(grpc:// localhost:port)机器并再次输入远程工人 ?如雅罗斯拉夫所提出的那样,变量对象和持久张量可以代替feed_dict。谢谢雅罗斯拉夫。


I use the in-graph replication of tensorflow to do distributed training. For reducing communicaiton cost purpose, i need hold some generated data (such as the cell states in LSTM) in some remote worker in one training iteration to next iteration, but i found that i can not achieve it.

If i use the fetch operation of 'session.run' interface to retrieve the data generated in one remote worker, and feed the data to this remoter worker in the next training iteration, the unnecessary network costs are produced, as below codes show:

cluster = tf.train.ClusterSpec({"worker": ["remoteIP0:port", "remoteIP1:port"]})
...

for i in xrange(2):
  with tf.device("/woker:local/task:%d" % i):
    with tf.name_scope('%s_%d' % (TOWER_NAME, i)) as scope:
      # execute code for building the model replica and one taining
      # step.
      ...
      initial_state[i] = ...
      ...
      weight[i] = ...
      bias[i] = ...
      cost[i] = ...
      ...
      gradient[i] =
      final_state[i] = 
      ...
 grad = aggregate_func(gradient[0], gradient[1])
 optimizer = tf.train.GradientDescentOptimizer(lr)
 train_op = optimizer.apply_gradient(grad)

 ...
 with tf.Session("grpc://localhost:port") as session:
   ...
   for k in xrange(max_step):
     cost, finalstate, _ = session.run([cost, final_state, train_op],
                             {initial_state: finalstate})
   ...

The 'final_state[i]' generated in iteration k need be assigned to the 'initial_state[i]' in iteration k+1 for every remote worker, how can we do the assignment in remote worker machine without fetching to the master(grpc://localhost:port) machine and feeding again to the remote workers ?

解决方案

Variable objects and persistent tensors can replace feed_dict, as Yaroslav proposed. Thanks Yaroslav.

这篇关于如何在分布式张量流中的图内副本中的远程工作者跨度迭代中生成数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆