分布式tensorflow参数服务器和工作器 [英] Distributed tensorflow parameter server and workers
问题描述
我一直在密切关注Imagenet分布式TF火车示例。
I was closely following the Imagenet distributed TF train example.
当在两个不同的示例上运行该示例时,我无法理解数据的分配方式工人?从理论上讲,不同的工作人员应查看数据的不同部分。另外,代码的哪一部分告诉参数在参数服务器上传递?像在multi-gpu的示例中一样,在'cpu:0'中有一个明确的部分。
I am not able to understand how distribution of data takes place when this example is being run on 2 different workers? In theory, different workers should see the different part of the data. Also, what part of the code tells the parameters to pass on the parameter server? Like in the multi-gpu example, there is explicit section for the 'cpu:0'.
推荐答案
不同的工作人员看到的不同通过从预处理图像的单个队列中取出微型批处理图像来使数据的一部分成为可能。详细地说,在用于训练Imagenet模型的分布式设置中,输入图像由多个线程进行预处理,并且预处理后的图像存储在单个 RandomShuffleQueue
中。您可以在 Variable
对象平均分配给充当参数服务器的工作程序。其他所有进行实际培训的工作人员都在步骤开始时获取变量,并在步骤结束时进行更新。
The different workers see different parts of the data by virtue of dequeuing a mini batch images from a single queue of preprocessed images. To elaborate, in the distributed setup for training the Imagenet model, the input images are preprocessed by multiple threads and the preprocessed images are stored in a single RandomShuffleQueue
. You can look for tf.RandomShuffleQueue
in this file to see how this is done. The multiple workers are organized as 'Inception towers' and each tower dequeues a mini batch of images from the same queue, and thus get different parts of the input. The picture here answers the second part of your question. Look for slim.variables.VariableDeviceChooser
in this file. The logic there makes sure that Variable
objects are assigned evenly to workers that act as parameter servers. All other workers doing the actual training fetch the variables at the beginning of a step and update them at the end of the step.
这篇关于分布式tensorflow参数服务器和工作器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!