Nesterov的加速梯度下降如何在Tensorflow中实施? [英] How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?

查看：307 发布时间：2020/5/4 9:58:16 python tensorflow machine-learning

本文介绍了Nesterov的加速梯度下降如何在Tensorflow中实施?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

tf.train.MomentumOptimizer 的文档提供了use_nesterov参数利用Nesterov的加速梯度(NAG)方法.

The documentation for tf.train.MomentumOptimizer offers a use_nesterov parameter to utilise Nesterov's Accelerated Gradient (NAG) method.

但是，NAG要求在要计算的当前变量之外的其他位置处进行梯度计算，并且apply_gradients接口仅允许传递当前梯度.因此，我不太了解如何使用此接口实现NAG算法.

However, NAG requires the gradient at a location other than that of the current variable to be calculated, and the apply_gradients interface only allows for the current gradient to be passed. So I don't quite understand how the NAG algorithm could be implemented with this interface.

文档介绍了有关实现的以下内容:

The documentation says the following about the implementation:

use_nesterov:如果为True，则使用Nesterov Momentum.参见 Sutskever等， 2013 .这实现始终以的值计算梯度变量传递给优化器.使用Nesterov动量可以使变量跟踪文件中称为theta_t + mu*v_t的值.

use_nesterov: If True use Nesterov Momentum. See Sutskever et al., 2013. This implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper.

在仔细阅读了链接中的论文后，我不确定此说明是否回答了我的问题.当界面不需要提供梯度函数时，如何实现NAG算法?

Having read through the paper in the link, I'm a little unsure about whether this description answers my question or not. How can the NAG algorithm be implemented when the interface doesn't require a gradient function to be provided?

推荐答案

TL; DR

TF对Nesterov的实现确实是原始公式的近似值，适用于高动量值.

TF's implementation of Nesterov is indeed an approximation of the original formula, valid for high values of momentum.

详细信息

这是一个很好的问题.在本文中，将NAG更新定义为

This is a great question. In the paper, the NAG update is defined as

v_t+1 = μ.v_t - λ.∇f(θ_t + μ.v_t)
θ_t+1 = θ_t + v_t+1

其中，f是我们的成本函数，θ_t是我们在时间t时的参数，μ动量，λ学习率； v_t是NAG的内部累加器.

where f is our cost function, θ_t our parameters at time t, μ the momentum, λ the learning rate; v_t is the NAG's internal accumulator.

与标准动量的主要区别在于使用了θ_t + μ.v_t处的梯度，而使用了θ_t处的不是梯度.但是正如您所说，tensorflow仅在θ_t使用梯度.那么诀窍是什么?

The main difference with standard momentum is the use of the gradient at θ_t + μ.v_t, not at θ_t. But as you said, tensorflow only uses gradient at θ_t. So what is the trick?

您所引用的文档部分实际上提到了一些技巧:算法正在跟踪θ_t + μ.v_t，不是 θ_t.另一部分来自对高动量值有效的近似值.

Part of the trick is actually mentioned in the part of the documentation you cited: the algorithm is tracking θ_t + μ.v_t, not θ_t. The other part comes from an approximation valid for high value of momentum.

让我们从纸张上稍微更改一下表示法，以使累加器遵循张量流的定义.让我们定义a_t = v_t / λ.更新规则的更改如下:

Let's make a slight change of notation from the paper for the accumulator to stick with tensorflow's definition. Let's define a_t = v_t / λ. The update rules are changed slightly as

a_t+1 = μ.a_t - ∇f(θ_t + μ.λ.a_t)
θ_t+1 = θ_t + λ.a_t+1

(导致TF发生这种变化的动机是，现在a是纯粹的梯度动量，与学习速率无关.这使得更新过程对于λ的变化具有鲁棒性，这在实践中很常见，但是纸不考虑.)

(The motivation for this change in TF is that now a is a pure gradient momentum, independent of the learning rate. This makes the update process robust to changes in λ, a possibility common in practice but that the paper does not consider.)

如果我们注意ψ_t = θ_t + μ.λ.a_t，那么

If we note ψ_t = θ_t + μ.λ.a_t, then

a_t+1 = μ.a_t - ∇f(ψ_t)
ψ_t+1 = θ_t+1 + μ.λ.a_t+1
    = θ_t + λ.a_t+1 + μ.λ.a_t+1
    = ψ_t + λ.a_t+1 + μ.λ.(a_t+1 - a_t)
    = ψ_t + λ.a_t+1 + μ.λ.[(μ-1)a_t - ∇f(ψ_t)]
    ≈ ψ_t + λ.a_t+1

最后一个近似值适用于强动量值，其中μ接近1，因此μ-1接近零，并且∇f(ψ_t)小于a-这个最后近似值更具争议性实际上，并且对于频繁切换梯度的方向无效.

This last approximation holds for strong values of momentum, where μ is close to 1, so that μ-1 is close to zero, and ∇f(ψ_t) is small compared to a — this last approximation is more debatable actually, and less valid for directions with frequent gradient switch.

我们现在有一个更新，它使用当前位置的梯度，并且规则非常简单-实际上是标准动量的规则.

We now have an update that uses the gradient of the current position, and the rules are pretty simple — they are in fact those of standard momentum.

但是，我们要的是θ_t，而不是ψ_t.这就是为什么我们在返回之前将μ.λ.a_t+1减去μ.λ.a_t+1并恢复ψ的原因，因此在下一次调用时又将其再次添加.

However, we want θ_t, not ψ_t. This is the reason why we subtract μ.λ.a_t+1 to ψ_t+1 just before returning it — and to recover ψ it is added again first thing at the next call.

这篇关于Nesterov的加速梯度下降如何在Tensorflow中实施?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Nesterov的加速梯度下降如何在Tensorflow中实施? [英] How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

Nesterov的加速梯度下降如何在Tensorflow中实施? [英] How is Nesterov&#39;s Accelerated Gradient Descent implemented in Tensorflow?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

Nesterov的加速梯度下降如何在Tensorflow中实施? [英] How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?

登录关闭