神经网络回归的小批量选择 [英] Selection of Mini-batch Size for Neural Network Regression

查看:168
本文介绍了神经网络回归的小批量选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做具有4个特征的神经网络回归.如何确定我的问题的小批量生产规模?我看到人们使用100〜1000批处理大小的计算机视觉,每个图像具有32 * 32 * 3个功能,这是否意味着我应该使用100万批处理大小?我有数十亿的数据和数十GB的内存,因此没有硬性要求,我不这样做.

I am doing a neural network regression with 4 features. How do I determine the size of mini-batch for my problem? I see people use 100 ~ 1000 batch size for computer vision with 32*32*3 features for each image, does that mean I should use batch size of 1 million? I have billions of data and tens of GB of memory so there is no hard requirement for me not to do that.

我还观察到使用大小为〜1000的微型批处理会使收敛快于批处理大小为100万的批处理.我认为应该相反,因为以较大的批次大小计算的梯度最能代表整个样品的梯度?为什么使用小批量处理可以使收敛更快?

I also observed using a mini-batch with size ~ 1000 makes the convergence much faster than batch size of 1 million. I thought it should be the other way around, since the gradient calculated with a larger batch size is most representative of the gradient of the whole sample? Why does using mini-batch make the convergence faster?

推荐答案

来自权衡批处理大小与要训练的迭代次数神经网络:

来自Nitish Shirish Keskar,Dheevatsa Mudigere,Jorge Nocedal,Mikhail Smelyanskiy,Ping Tak Peter Tang.关于深度学习的大批量培训:泛化差距和夏普最小值. https://arxiv.org/abs/1609.04836 :

From Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. https://arxiv.org/abs/1609.04836 :

随机梯度下降方法及其变体是许多深度学习任务的首选算法.这些方法以小批量方式运行,其中一部分训练数据(通常为32--512个数据点)被采样以计算梯度的近似值. 在实践中已经观察到,使用较大的批次时,通过模型的泛化能力衡量,模型的质量会显着下降.已经进行了一些尝试来调查造成这种情况的原因.大批量生产体制的普遍性下降,然而,这种现象的确切答案迄今仍是未知的.在本文中,我们提供了足够的数值证据来支持这样的观点,即大批量方法趋向于收敛到训练和测试功能的最小限度极小点-尖锐的极小点导致较差的泛化性.相比之下,小批量方法始终收敛于平面极小值,并且我们的实验支持一种普遍认为的观点,这归因于梯度估计中的固有噪声.我们还将讨论一些经验策略,这些策略可帮助大批量方法消除泛化差距,并得出一系列未来研究思路和未解决的问题.

The stochastic gradient descent method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, usually 32--512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. There have been some attempts to investigate the cause for this generalization drop in the large-batch regime, however the precise answer for this phenomenon is, hitherto unknown. In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We also discuss several empirical strategies that help large-batch methods eliminate the generalization gap and conclude with a set of future research ideas and open questions.

[…]

缺乏泛化能力的原因是,大批量方法趋向于收敛到训练功能的尖锐的最小化器 .这些最小化器的特征在于$ \ nabla ^ 2 f(x)$中的大正特征值,并且倾向于推广得不太好.相反,小批量方法收敛于平坦的最小化器,其特征在于$ \ nabla ^ 2 f(x)$的正特征值较小.我们已经观察到,深度神经网络的损失函数态势如此,使得大批量方法几乎总是吸引到具有尖锐最小值的区域,并且与小批量方法不同,它们无法逃脱这些最小化器的流域.

The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by large positive eigenvalues in $\nabla^2 f(x)$ and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by small positive eigenvalues of $\nabla^2 f(x)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.

[…]

此外, Ian Goodfellow 回答

Also, some good insights from Ian Goodfellow answering to why do not use the whole training set to compute the gradient? on Quora:

学习率的大小主要受诸如 成本函数是弯曲的.您可以将梯度下降视为 对成本函数进行线性近似,然后移动 沿着该近似成本下山.如果成本函数很高 非线性(高度弯曲),则逼近不会非常大 非常适合远距离,因此只有小步长是安全的.你可以阅读 深度学习教科书第4章中的更多内容 数值计算: http://www.deeplearningbook.org/contents/numerical.html

The size of the learning rate is limited mostly by factors like how curved the cost function is. You can think of gradient descent as making a linear approximation to the cost function, then moving downhill along that approximate cost. If the cost function is highly non-linear (highly curved) then the approximation will not be very good for very far, so only small step sizes are safe. You can read more about this in Chapter 4 of the deep learning textbook, on numerical computation: http://www.deeplearningbook.org/contents/numerical.html

当你放 在一个小批量中的m个示例中,您需要进行O(m)计算并使用 O(m)内存,但可以减少梯度中的不确定性 仅乘以O(sqrt(m)).换句话说,正在减少 边际收益可在微型批次中投放更多示例.你可以 在深度学习教科书的第8章中了解有关此内容的更多信息, 深度学习的优化算法: http://www.deeplearningbook.org/contents/optimization.html

When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)). In other words, there are diminishing marginal returns to putting more examples in the minibatch. You can read more about this in Chapter 8 of the deep learning textbook, on optimization algorithms for deep learning: http://www.deeplearningbook.org/contents/optimization.html

如果 您会思考,即使使用整个训练集也并非如此 给你真正的渐变.真正的渐变将是预期的 在所有可能的示例中都包含期望值的梯度, 由数据生成分布加权.使用整个 训练集只是使用非常大的小批量, 您的小批量商品数量受您在数据上花费的金额的限制 集合,而不是您在计算上花费的金额.

Also, if you think about it, even using the entire training set doesn’t really give you the true gradient. The true gradient would be the expected gradient with the expectation taken over all possible examples, weighted by the data generating distribution. Using the entire training set is just using a very large minibatch size, where the size of your minibatch is limited by the amount you spend on data collection, rather than the amount you spend on computation.

这篇关于神经网络回归的小批量选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆