Tensorflow:关于亚当优化器的困惑 [英] Tensorflow: Confusion regarding the adam optimizer
问题描述
我对 adam 优化器在 tensorflow 中的实际工作方式感到困惑.
我阅读文档的方式,它表示每次梯度下降迭代都会改变学习率.
但是当我调用这个函数时,我给了它一个学习率.而且我不会调用函数来让我们说,做一个时代(隐式调用 # 迭代以完成我的数据训练).我为每个批次显式调用函数,如
for epochs 中的 epoch用于批量输入数据sess.run(train_adam_step, feed_dict={eta:1e-3})
所以我的 eta 不能改变.而且我没有传入时间变量.或者这是某种生成器类型的东西,每次我调用优化器时 t
都会在会话创建时递增?
假设它是某种生成器类型的东西,并且学习率正在无形中降低:如何在不降低学习率的情况下运行 adam 优化器?在我看来,RMSProp 基本上是同样,为了使其相等(忽略学习率),我唯一要做的就是更改超参数 momentum
和 decay
以匹配 beta1
和 beta2
分别.对吗?
我觉得文档很清楚,我把算法用伪代码贴在这里:
您的参数:
learning_rate
:1e-4 和 1e-2 之间是标准的beta1
:默认为 0.9beta2
:默认为 0.999epsilon
:默认为 1e-08<块引用>epsilon 的默认值 1e-8 通常可能不是一个好的默认值.例如,在 ImageNet 上训练 Inception 网络时,当前的最佳选择是 1.0 或 0.1.
初始化:
m_0 <- 0 (初始化初始一阶矩向量)v_0 <- 0(初始化初始二阶矩向量)t <- 0(初始化时间步长)
对于网络的每个参数,
m_t
和 v_t
将跟踪梯度及其平方的移动平均值.(所以如果你有 1M 个参数,Adam 会在内存中多保留 2M 个参数)
在每次迭代t
,对于模型的每个参数:
t <- t + 1lr_t <- learning_rate * sqrt(1 - beta2^t)/(1 - beta1^t)m_t <- beta1 * m_{t-1} + (1 - beta1) * 梯度v_t <- beta2 * v_{t-1} + (1 - beta2) * 梯度 ** 2变量 <- 变量 - lr_t * m_t/(sqrt(v_t) + epsilon)
<小时>
这里的 lr_t
与 learning_rate
有点不同,因为对于早期迭代,移动平均线还没有收敛,所以我们必须通过乘以 sqrt(1 - beta2^t)/(1 - beta1^t)
.当t
为高(t > 1./(1.-beta2)
)时,lr_t
几乎等于learning_rate代码>
要回答你的问题,你只需要传递一个固定学习率,保持beta1
和beta2
的默认值,或者修改epsilon
,亚当会变魔术 :)
与 RMSProp 链接
beta1=1
的 Adam 等价于 momentum=0
的 RMSProp.Adam的参数beta2
和RMSProp的参数decay
是一样的.
然而,RMSProp 不会保持梯度的移动平均值.但它可以像 MomentumOptimizer 一样保持势头.
rmsprop 的详细说明.
- 保持梯度平方的移动(折扣)平均值
- 将梯度除以这个平均值的根
- (可以保持势头)
伪代码如下:
v_t <- 衰减 * v_{t-1} + (1-衰减) * 梯度 ** 2mom = 动量 * mom{t-1} + learning_rate * 梯度/sqrt(v_t + epsilon)变量 <- 变量 - 妈妈
I'm confused regarding as to how the adam optimizer actually works in tensorflow.
The way I read the docs, it says that the learning rate is changed every gradient descent iteration.
But when I call the function I give it a learning rate. And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). I call the function for each batch explicitly like
for epoch in epochs
for batch in data
sess.run(train_adam_step, feed_dict={eta:1e-3})
So my eta cannot be changing. And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t
is incremented each time I call the optimizer?
Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum
and decay
to match beta1
and beta2
respectively. Is that correct?
I find the documentation quite clear, I will paste here the algorithm in pseudo-code:
Your parameters:
learning_rate
: between 1e-4 and 1e-2 is standardbeta1
: 0.9 by defaultbeta2
: 0.999 by defaultepsilon
: 1e-08 by defaultThe default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Initialization:
m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
m_t
and v_t
will keep track of a moving average of the gradient and its square, for each parameters of the network. (So if you have 1M parameters, Adam will keep in memory 2M more parameters)
At each iteration t
, and for each parameter of the model:
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
Here lr_t
is a bit different from learning_rate
because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t)
. When t
is high (t > 1./(1.-beta2)
), lr_t
is almost equal to learning_rate
To answer your question, you just need to pass a fixed learning rate, keep beta1
and beta2
default values, maybe modify epsilon
, and Adam will do the magic :)
Link with RMSProp
Adam with beta1=1
is equivalent to RMSProp with momentum=0
. The argument beta2
of Adam and the argument decay
of RMSProp are the same.
However, RMSProp does not keep a moving average of the gradient. But it can maintain a momentum, like MomentumOptimizer.
A detailed description of rmsprop.
- maintain a moving (discounted) average of the square of gradients
- divide gradient by the root of this average
- (can maintain a momentum)
Here is the pseudo-code:
v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom
这篇关于Tensorflow:关于亚当优化器的困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!