在分布式张量流学习中使用参数服务器的原因是什么? [英] What is the reason to use parameter server in distributed tensorflow learning?

查看:36
本文介绍了在分布式张量流学习中使用参数服务器的原因是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短版本:我们不能在其中一个工作线程中存储变量而不使用参数服务器吗?

Short version: can't we store variables in one of the workers and not use parameter servers?

长版:我想在tensorflow中实现神经网络的同步分布式学习.我希望每个工人在训练期间都拥有模型的完整副本.

Long version: I want to implement synchronous distributed learning of neural network in tensorflow. I want each worker to have a full copy of the model during training.

我已阅读分布式张量流教程分布式训练imagenet代码,不明白为什么我们需要参数服务器吗.

I've read distributed tensorflow tutorial and code of distributed training imagenet and didn't get why do we need parameter servers.

我看到它们用于存储变量的值,而 replica_device_setter 会注意变量在参数服务器之间均匀分布(可能它做了更多的事情,我无法完全理解代码).

I see that they are used for storing values of variables and replica_device_setter takes care that variables are evenly distributed between parameter servers (probably it does something more, I wasn't able to fully understand the code).

问题是:为什么我们不使用其中一个工人来存储变量?如果我使用

The question is: why don't we use one of the workers to store variables? Will I achieve that if I use

with tf.device('/job:worker/task:0/cpu:0'):

代替

with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):

对于变量?如果这可行,与使用参数服务器的解决方案相比是否存在缺点?

for Variaibles? If that works is there downside comparing to solution with parameter servers?

推荐答案

使用参数服务器可以为您提供更好的网络利用率,并让您将模型扩展到更多机器.

Using parameter server can give you better network utilization, and lets you scale your models to more machines.

一个具体的例子,假设你有 250M 的参数,计算每个 worker 的梯度需要 1 秒,并且有 10 个 worker.这意味着每个 worker 每秒必须向其他 9 个 worker 发送/接收 1 GB 数据,这需要每个 worker 72 Gbps 的全双工网络容量,这是不切实际的.

A concrete example, suppose you have 250M parameters, it takes 1 second to compute gradient on each worker, and there are 10 workers. This means that each worker has to send/receive 1 GB of data to 9 other workers every second, which needs 72 Gbps full duplex network capacity on each worker, which is not practical.

更现实的是,每个工作人员可以拥有 10 Gbps 的网络容量.您可以通过将参数服务器拆分为 8 台机器来防止网络瓶颈.每台工作机与每台参数机通信1/8的参数.

More realistically you could have 10 Gbps network capacity per worker. You prevent network bottlenecks by using parameter server split over 8 machines. Each worker machine communicates with each parameter machine for 1/8th of parameters.

这篇关于在分布式张量流学习中使用参数服务器的原因是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆