在 tensorflow 上使用批量大小作为“2 的幂"是否更快? [英] Is using batch size as 'powers of 2' faster on tensorflow?

查看:24
本文介绍了在 tensorflow 上使用批量大小作为“2 的幂"是否更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从某处读到,如果您选择的批量大小为 2 的幂,则训练速度会更快.这是什么规则?这适用于其他应用程序吗?你能提供一份参考论文吗?

解决方案

从算法上讲,使用更大的 mini-batch 可以减少随机梯度更新的方差(通过取 mini-batch 中梯度的平均值),这反过来又允许您采用更大的步长,这意味着优化算法将取得更快的进展.

然而,在目标中达到一定精度所完成的工作量(就梯度计算的数量而言)将是相同的:当 mini-batch 大小为 n 时,更新方向的方差将减少乘以因子 n,因此该理论允许您采用大 n 倍的步长,因此单步将带您大致达到与小批量大小为 1 的 SGD 的 n 步相同的准确度.

关于tensorFlow,我没有找到你肯定的证据,而且是github上已经关闭的问题:https://github.com/tensorflow/tensorflow/issues/4132

请注意,将图像大小调整为 2 的幂是有意义的(因为池化通常在 2X2 窗口中完成),但这完全不同.

I read from somewhere that if you choose a batch size that is a power 2, training will be faster. What is this rule? Is this applicable to other applications? Can you provide a reference paper?

解决方案

Algorithmically speaking, using larger mini-batches allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.

However, the amount of work done (in terms of number of gradient computations) to reach a certain accuracy in the objective will be the same: with a mini-batch size of n, the variance of the update direction will be reduced by a factor n, so the theory allows you to take step-sizes that are n times larger, so that a single step will take you roughly to the same accuracy as n steps of SGD with a mini-batch size of 1.

As for tensorFlow, I found no evidence of your affirmation, and its a question that has been closed on github : https://github.com/tensorflow/tensorflow/issues/4132

Note that image resized to power of two makes sense (because pooling is generally done in 2X2 windows), but that’s a different thing altogether.

这篇关于在 tensorflow 上使用批量大小作为“2 的幂"是否更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆