Tensorflow:关于亚当优化器的困惑 [英] Tensorflow: Confusion regarding the adam optimizer

查看:40
本文介绍了Tensorflow:关于亚当优化器的困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 adam 优化器在 tensorflow 中的实际工作方式感到困惑.

我阅读文档的方式,它表示每次梯度下降迭代都会改变学习率.

但是当我调用这个函数时,我给了它一个学习率.而且我不会调用函数来让我们说,做一个时代(隐式调用 # 迭代以完成我的数据训练).我为每个批次显式调用函数,如

for epochs 中的 epoch用于批量输入数据sess.run(train_adam_step, feed_dict={eta:1e-3})

所以我的 eta 不能改变.而且我没有传入时间变量.或者这是某种生成器类型的东西,每次我调用优化器时 t 都会在会话创建时递增?

假设它是某种生成器类型的东西,并且学习率正在无形中降低:如何在不降低学习率的情况下运行 adam 优化器?在我看来,RMSProp 基本上是同样,为了使其相等(忽略学习率),我唯一要做的就是更改超参数 momentumdecay 以匹配 beta1beta2 分别.对吗?

解决方案

我觉得文档很清楚,我把算法用伪代码贴在这里:

您的参数:

  • learning_rate:1e-4 和 1e-2 之间是标准的
  • beta1:默认为 0.9
  • beta2:默认为 0.999
  • epsilon:默认为 1e-08<块引用>

    epsilon 的默认值 1e-8 通常可能不是一个好的默认值.例如,在 ImageNet 上训练 Inception 网络时,当前的最佳选择是 1.0 或 0.1.

<小时>

初始化:

m_0 <- 0 (初始化初始一阶矩向量)v_0 <- 0(初始化初始二阶矩向量)t <- 0(初始化时间步长)

对于网络的每个参数,

m_tv_t 将跟踪梯度及其平方的移动平均值.(所以如果你有 1M 个参数,Adam 会在内存中多保留 2M 个参数)

<小时>

在每次迭代t,对于模型的每个参数:

t <- t + 1lr_t <- learning_rate * sqrt(1 - beta2^t)/(1 - beta1^t)m_t <- beta1 * m_{t-1} + (1 - beta1) * 梯度v_t <- beta2 * v_{t-1} + (1 - beta2) * 梯度 ** 2变量 <- 变量 - lr_t * m_t/(sqrt(v_t) + epsilon)

<小时>

这里的 lr_tlearning_rate 有点不同,因为对于早期迭代,移动平均线还没有收敛,所以我们必须通过乘以 sqrt(1 - beta2^t)/(1 - beta1^t).当t为高(t > 1./(1.-beta2))时,lr_t几乎等于learning_rate

<小时>

要回答你的问题,你只需要传递一个固定学习率,保持beta1beta2的默认值,或者修改epsilon,亚当会变魔术 :)

<小时>

与 RMSProp 链接

beta1=1 的 Adam 等价于 momentum=0 的 RMSProp.Adam的参数beta2和RMSProp的参数decay是一样的.

然而,RMSProp 不会保持梯度的移动平均值.但它可以像 MomentumOptimizer 一样保持势头.

rmsprop 的详细说明.

  • 保持梯度平方的移动(折扣)平均值
  • 将梯度除以这个平均值的根
  • (可以保持势头)
<小时>

伪代码如下:

v_t <- 衰减 * v_{t-1} + (1-衰减) * 梯度 ** 2mom = 动量 * mom{t-1} + learning_rate * 梯度/sqrt(v_t + epsilon)变量 <- 变量 - 妈妈

I'm confused regarding as to how the adam optimizer actually works in tensorflow.

The way I read the docs, it says that the learning rate is changed every gradient descent iteration.

But when I call the function I give it a learning rate. And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). I call the function for each batch explicitly like

for epoch in epochs
     for batch in data
          sess.run(train_adam_step, feed_dict={eta:1e-3})

So my eta cannot be changing. And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t is incremented each time I call the optimizer?

Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum and decay to match beta1 and beta2 respectively. Is that correct?

解决方案

I find the documentation quite clear, I will paste here the algorithm in pseudo-code:

Your parameters:

  • learning_rate: between 1e-4 and 1e-2 is standard
  • beta1: 0.9 by default
  • beta2: 0.999 by default
  • epsilon: 1e-08 by default

    The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.


Initialization:

m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)

m_t and v_t will keep track of a moving average of the gradient and its square, for each parameters of the network. (So if you have 1M parameters, Adam will keep in memory 2M more parameters)


At each iteration t, and for each parameter of the model:

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)


Here lr_t is a bit different from learning_rate because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t). When t is high (t > 1./(1.-beta2)), lr_t is almost equal to learning_rate


To answer your question, you just need to pass a fixed learning rate, keep beta1 and beta2 default values, maybe modify epsilon, and Adam will do the magic :)


Link with RMSProp

Adam with beta1=1 is equivalent to RMSProp with momentum=0. The argument beta2 of Adam and the argument decay of RMSProp are the same.

However, RMSProp does not keep a moving average of the gradient. But it can maintain a momentum, like MomentumOptimizer.

A detailed description of rmsprop.

  • maintain a moving (discounted) average of the square of gradients
  • divide gradient by the root of this average
  • (can maintain a momentum)

Here is the pseudo-code:

v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom

这篇关于Tensorflow:关于亚当优化器的困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆