神经网络学习率和批次权重更新 [英] Neural Network learning rate and batch weight update

查看:174
本文介绍了神经网络学习率和批次权重更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用Java编写了一个神经网络,现在正在研究反向传播算法.

I have programmed a Neural Network in Java and am now working on the back-propagation algorithm.

我已经读到,权重的批量更新将使梯度搜索更加稳定,而不是在线进行权重更新.

I've read that batch updates of the weights will cause a more stable gradient search instead of a online weight update.

作为测试,我创建了100点的时间序列函数,例如x = [0..99]y = f(x).我创建了一个具有一个输入和一个输出以及2个包含10个神经元的隐藏层的神经网络以进行测试.我正在努力解决的是反向传播算法在解决此问题时的学习率.

As a test I've created a time series function of 100 points, such that x = [0..99] and y = f(x). I've created a Neural Network with one input and one output and 2 hidden layers with 10 neurons for testing. What I am struggling with is the learning rate of the back-propagation algorithm when tackling this problem.

我有100个输入点,所以当我计算每个节点的权重变化dw_{ij}时,它实际上是一个和:

I have 100 input points so when I calculate the weight change dw_{ij} for each node it is actually a sum:

dw_{ij} = dw_{ij,1} + dw_{ij,2} + ... + dw_{ij,p}

在这种情况下为p = 100.

现在权重更新变得非常巨大,因此我的错误E反弹,因此很难找到最小值.我得到某些正确行为的唯一方法是将学习率y设置为诸如0.7 / p^2之类的东西.

Now the weight updates become really huge and therefore my error E bounces around such that it is hard to find a minimum. The only way I got some proper behaviour was when I set the learning rate y to something like 0.7 / p^2.

是否有一些基于样本数量设置学习率的一般规则?

Is there some general rule for setting the learning rate, based on the amount of samples?

推荐答案

http://francky.me/faqai.php#otherFAQs :

主题:应该使用什么学习率 反向传播?

Subject: What learning rate should be used for backprop?

在标准反向传播中,太低的学习速度会使网络学习非常缓慢.学习率太高 使权重和目标函数发散,因此根本没有学习.如果目标函数是 与线性模型一样,可以通过Hessian矩阵(Bertsekas和 (Tsitsiklis,1996).如果目标函数具有许多局部和全局最优值,例如典型的前馈神经网络 在隐藏单元的情况下,最佳学习率通常会在训练过程中发生巨大变化,因为 黑森州也发生了巨大变化.尝试以恒定的学习率训练NN通常是 繁琐的过程需要大量的反复试验.有关如何选择学习率和 动量与数值条件在一些非常简单的网络中相互作用,请参见 ftp://ftp.sas.com/pub/neural/illcond/illcond.html

In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning rate makes the weights and objective function diverge, so there is no learning at all. If the objective function is quadratic, as in linear models, good learning rates can be computed from the Hessian matrix (Bertsekas and Tsitsiklis, 1996). If the objective function has many local and global optima, as in typical feedforward NNs with hidden units, the optimal learning rate often changes dramatically during the training process, since the Hessian also changes dramatically. Trying to train a NN using a constant learning rate is usually a tedious process requiring much trial and error. For some examples of how the choice of learning rate and momentum interact with numerical condition in some very simple networks, see ftp://ftp.sas.com/pub/neural/illcond/illcond.html

通过分批培训,无需使用恒定的学习率.实际上,没有理由使用 根本就没有标准的backprop,因为存在更高效,可靠和方便的批处理训练算法 (请参阅什么是反向传播?"下的Quickprop和RPROP,以及提到的众多训练算法 什么是共轭梯度,Levenberg-Marquardt等?".

With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use standard backprop at all, since vastly more efficient, reliable, and convenient batch training algorithms exist (see Quickprop and RPROP under "What is backprop?" and the numerous training algorithms mentioned under "What are conjugate gradients, Levenberg-Marquardt, etc.?").

已经发明了反向支撑的许多其他变体.大多数人遭受与 标准反向传播器:权重变化的幅度(步长)不应是 梯度的大小.在权重空间的某些区域,梯度很小,您需要 步长大;当您初始化随机权重小的网络时,会发生这种情况.在其他地区 重量空间,坡度小,您需要较小的步长;当您接近一个地方时,会发生这种情况 局部最小值.同样,较大的梯度可能需要较小的步长或较大的步长.许多算法 尝试适应学习率,但是任何将学习率乘以梯度以进行计算的算法 当梯度突然变化时,权重的变化很可能会产生不稳定的行为.这 Quickprop和RPROP的最大优势在于它们对产品的依赖程度不高. 梯度的大小.常规的优化算法不仅使用梯度,而且使用二阶导数或线搜索(或其某种组合)来获得良好的步长.

Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function of the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need a large step size; this happens when you initialize a network with small random weights. In other regions of the weight space, the gradient is small and you need a small step size; this happens when you are close to a local minimum. Likewise, a large gradient may call for either a small step or a large step. Many algorithms try to adapt the learning rate, but any algorithm that multiplies the learning rate by the gradient to compute the change in the weights is likely to produce erratic behavior when the gradient changes abruptly. The great advantage of Quickprop and RPROP is that they do not have this excessive dependence on the magnitude of the gradient. Conventional optimization algorithms use not only the gradient but also secondorder derivatives or a line search (or some combination thereof) to obtain a good step size.

通过渐进式训练,炮制出一种自动调整算法的算法要困难得多. 训练期间的学习率. NN文献中出现了各种各样的建议,但大多数没有 工作.这些建议中的一些问题由Darken和Moody(1992)说明, 不幸的是没有提供解决方案. LeCun,Simard和 Pearlmutter(1993)以及Orr和Leen(1997)提出,他们调整了动量而不是学习率. 随机近似也有一个变种,称为迭代平均"或"Polyak平均" (Kushner and Yin 1997),从理论上讲,通过保持运行状态可以提供最佳收敛速度. 重量值的平均值.我对这些方法没有任何经验.如果你有什么扎实的 证明这些或其他方法可以自动设置学习速度和/或动量 增量训练实际上可以在多种NN应用程序中工作,请通知FAQ维护者 (saswss@unx.sas.com).

With incremental training, it is much more difficult to concoct an algorithm that automatically adjusts the learning rate during training. Various proposals have appeared in the NN literature, but most of them don't work. Problems with some of these proposals are illustrated by Darken and Moody (1992), who unfortunately do not offer a solution. Some promising results are provided by by LeCun, Simard, and Pearlmutter (1993), and by Orr and Leen (1997), who adapt the momentum rather than the learning rate. There is also a variant of stochastic approximation called "iterate averaging" or "Polyak averaging" (Kushner and Yin 1997), which theoretically provides optimal convergence rates by keeping a running average of the weight values. I have no personal experience with these methods; if you have any solid evidence that these or other methods of automatically setting the learning rate and/or momentum in incremental training actually work in a wide variety of NN applications, please inform the FAQ maintainer (saswss@unx.sas.com).

参考:

  • Bertsekas,D.P.和Tsitsiklis,J.N.(1996),《神经动力学》 编程,马萨诸塞州贝尔蒙特:Athena Scientific,ISBN 1-886529-10-8.
  • Darken,C.和Moody,J.(1992),迈向更快的随机梯度 搜索",见Moody,J.E.,Hanson,S.J.和Lippmann,R.P.编.
  • 神经信息处理系统的发展,加利福尼亚州圣马特奥市4: Morgan Kaufmann出版社,第1009-1016页.库什纳(H.J.)和尹, G.(1997),随机逼近算法和应用,纽约州: 施普林格出版社. Y. LeCun,P.Y.Simard和B. Pearlmetter (1993),通过在线评估自动学习率最大化 "Hessian的特征向量",位于Hanson,S.J.,Cowan,J.D.和Giles,
  • C.L. (主编),《神经信息处理系统的进展》,第5期,San 加州马特奥(Mateo,CA):摩根考夫曼(Morgan Kaufmann),第156-163页. Orr,GB以及Leen,T.K. (1997),使用曲率信息进行快速随机搜索",
  • 密苏里州莫泽(Mozer),密歇根州乔丹(Jordan)和田纳西州彼得斯(Petsche)(编辑)神经学研究进展 信息处理系统9,剑桥,麻省:麻省理工学院出版社,pp. 606-612.
  • Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.
  • Darken, C. and Moody, J. (1992), "Towards faster stochastic gradient search," in Moody, J.E., Hanson, S.J., and Lippmann, R.P., eds.
  • Advances in Neural Information Processing Systems 4, San Mateo, CA: Morgan Kaufmann Publishers, pp. 1009-1016. Kushner, H.J., and Yin, G. (1997), Stochastic Approximation Algorithms and Applications, NY: Springer-Verlag. LeCun, Y., Simard, P.Y., and Pearlmetter, B. (1993), "Automatic learning rate maximization by online estimation of the Hessian's eigenvectors," in Hanson, S.J., Cowan, J.D., and Giles,
  • C.L. (eds.), Advances in Neural Information Processing Systems 5, San Mateo, CA: Morgan Kaufmann, pp. 156-163. Orr, G.B. and Leen, T.K. (1997), "Using curvature information for fast stochastic search," in
  • Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9,Cambridge, MA: The MIT Press, pp. 606-612.

积分:

  • 存档名称:ai-faq/neural-nets/part1
  • 最后修改时间:2002-05-17
  • URL: ftp://ftp.sas.com/pub/neural/FAQ. html
  • 维护者:saswss@unx.sas.com(Warren S. Sarle)
  • 由美国北卡罗来纳州卡里市的Warren S.Sarle版权所有1997、1998、1999、2000、2001、2002.
  • Archive-name: ai-faq/neural-nets/part1
  • Last-modified: 2002-05-17
  • URL: ftp://ftp.sas.com/pub/neural/FAQ.html
  • Maintainer: saswss@unx.sas.com (Warren S. Sarle)
  • Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC, USA.

这篇关于神经网络学习率和批次权重更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆