Nesterov的加速梯度下降如何在Tensorflow中实施? [英] How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?

查看:307
本文介绍了Nesterov的加速梯度下降如何在Tensorflow中实施?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

tf.train.MomentumOptimizer 的文档提供了use_nesterov参数利用Nesterov的加速梯度(NAG)方法.

The documentation for tf.train.MomentumOptimizer offers a use_nesterov parameter to utilise Nesterov's Accelerated Gradient (NAG) method.

但是,NAG要求在要计算的当前变量之外的其他位置处进行梯度计算,并且apply_gradients接口仅允许传递当前梯度.因此,我不太了解如何使用此接口实现NAG算法.

However, NAG requires the gradient at a location other than that of the current variable to be calculated, and the apply_gradients interface only allows for the current gradient to be passed. So I don't quite understand how the NAG algorithm could be implemented with this interface.

文档介绍了有关实现的以下内容:

The documentation says the following about the implementation:

use_nesterov:如果为True,则使用Nesterov Momentum.参见 Sutskever等, 2013 .这 实现始终以的值计算梯度 变量传递给优化器.使用Nesterov动量可以使 变量跟踪文件中称为theta_t + mu*v_t的值.

use_nesterov: If True use Nesterov Momentum. See Sutskever et al., 2013. This implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper.

在仔细阅读了链接中的论文后,我不确定此说明是否回答了我的问题.当界面不需要提供梯度函数时,如何实现NAG算法?

Having read through the paper in the link, I'm a little unsure about whether this description answers my question or not. How can the NAG algorithm be implemented when the interface doesn't require a gradient function to be provided?

推荐答案

TL; DR

TF对Nesterov的实现确实是原始公式的近似值,适用于高动量值.

TF's implementation of Nesterov is indeed an approximation of the original formula, valid for high values of momentum.

详细信息

这是一个很好的问题.在本文中,将NAG更新定义为

This is a great question. In the paper, the NAG update is defined as

vt+1 = μ.vt - λ.∇f(θt + μ.vt)
θt+1 = θt + vt+1

其中,f是我们的成本函数,θt是我们在时间t时的参数,μ动量,λ学习率; vt是NAG的内部累加器.

where f is our cost function, θt our parameters at time t, μ the momentum, λ the learning rate; vt is the NAG's internal accumulator.

与标准动量的主要区别在于使用了θt + μ.vt处的梯度,而使用了θt处的不是梯度.但是正如您所说,tensorflow仅在θt使用梯度.那么诀窍是什么?

The main difference with standard momentum is the use of the gradient at θt + μ.vt, not at θt. But as you said, tensorflow only uses gradient at θt. So what is the trick?

您所引用的文档部分实际上提到了一些技巧:算法正在跟踪θt + μ.vt不是 θt.另一部分来自对高动量值有效的近似值.

Part of the trick is actually mentioned in the part of the documentation you cited: the algorithm is tracking θt + μ.vt, not θt. The other part comes from an approximation valid for high value of momentum.

让我们从纸张上稍微更改一下表示法,以使累加器遵循张量流的定义.让我们定义at = vt / λ.更新规则的更改如下:

Let's make a slight change of notation from the paper for the accumulator to stick with tensorflow's definition. Let's define at = vt / λ. The update rules are changed slightly as

at+1 = μ.at - ∇f(θt + μ.λ.at)
θt+1 = θt + λ.at+1

(导致TF发生这种变化的动机是,现在a是纯粹的梯度动量,与学习速率无关.这使得更新过程对于λ的变化具有鲁棒性,这在实践中很常见,但是纸不考虑.)

(The motivation for this change in TF is that now a is a pure gradient momentum, independent of the learning rate. This makes the update process robust to changes in λ, a possibility common in practice but that the paper does not consider.)

如果我们注意ψt = θt + μ.λ.at,那么

If we note ψt = θt + μ.λ.at, then

at+1 = μ.at - ∇f(ψt)
ψt+1 = θt+1 + μ.λ.at+1
    = θt + λ.at+1 + μ.λ.at+1
    = ψt + λ.at+1 + μ.λ.(at+1 - at)
    = ψt + λ.at+1 + μ.λ.[(μ-1)at - ∇f(ψt)]
    ≈ ψt + λ.at+1

最后一个近似值适用于强动量值,其中μ接近1,因此μ-1接近零,并且∇f(ψt)小于a-这个最后近似值更具争议性实际上,并且对于频繁切换梯度的方向无效.

This last approximation holds for strong values of momentum, where μ is close to 1, so that μ-1 is close to zero, and ∇f(ψt) is small compared to a — this last approximation is more debatable actually, and less valid for directions with frequent gradient switch.

我们现在有一个更新,它使用当前位置的梯度,并且规则非常简单-实际上是标准动量的规则.

We now have an update that uses the gradient of the current position, and the rules are pretty simple — they are in fact those of standard momentum.

但是,我们要的是θt,而不是ψt.这就是为什么我们在返回之前将μ.λ.at+1减去μ.λ.at+1并恢复ψ的原因,因此在下一次调用时又将其再次添加.

However, we want θt, not ψt. This is the reason why we subtract μ.λ.at+1 to ψt+1 just before returning it — and to recover ψ it is added again first thing at the next call.

这篇关于Nesterov的加速梯度下降如何在Tensorflow中实施?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆