在Tensorflow上训练多GPU:更简单的方法? [英] Training Multi-GPU on Tensorflow: a simpler way?

查看:293
本文介绍了在Tensorflow上训练多GPU:更简单的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用cifar10_multi_gpu_train示例中提出的训练方法进行(本地)多gpu训练,即创建多个塔然后平均梯度.但是,我想知道以下几点:如果我只接受来自不同GPU的损失,将其求和,然后对新损失应用梯度下降,会发生什么情况.

I have been using the training method proposed in the cifar10_multi_gpu_train example for (local) multi-gpu training, i.e., creating several towers and then average the gradient. However, I was wondering the following: What does happen if I just take the losses coming from the different GPUs, sum them up and then just apply gradient descent to that new loss.

那行得通吗?可能这是一个愚蠢的问题,并且在某处必须有一个限制.因此,如果您能对此发表评论,我会很高兴.

Would that work? Probably this is a silly question, and there must be a limitation somewhere. So I would be happy if you could comment on this.

感谢和最诚挚的问候, G.

Thanks and best regards, G.

推荐答案

与总和不起作用.您会得到更大的损失,从而导致更大的梯度并且可能是错误的梯度.在对梯度求平均值时,您将获得权重必须采取的平均方向,以最大程度地减少损失,但是每个方向都是为精确损失值计算的方向.

It would not work with the sum. You would get a bigger loss and consequentially bigger and probably erroneous gradients. While averaging the gradients you get an average of the direction that the weights have to take in order to minimize the loss, but each single direction is the one computed for the exact loss value.

您可以尝试的一件事是独立运行塔,然后不时平均权重,收敛速度较慢,但​​每个节点上的处理速度更快.

One thing that you can try is to run the towers independently and then average the weights from time to time, slower convergence rate but faster processing on each node.

这篇关于在Tensorflow上训练多GPU:更简单的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆