TensorFlow 中 AdamOptimizer 的学习率不会改变 [英] Learning rate doesn't change for AdamOptimizer in TensorFlow

查看:186
本文介绍了TensorFlow 中 AdamOptimizer 的学习率不会改变的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想看看学习率在训练过程中是如何变化的(打印出来或创建一个摘要并在张量板中可视化).

这是我目前所拥有的代码片段:

optimizer = tf.train.AdamOptimizer(1e-3)grads_and_vars = optimizer.compute_gradients(loss)train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)sess.run(tf.initialize_all_variables())对于范围内的 i (0, 10000):sess.run(train_op)打印 sess.run(optimizer._lr_t)

如果我运行代码,我会不断获得初始学习率 (1e-3),即我看不到任何变化.

获取每一步学习率的正确方法是什么?

我想补充一下这个问题 与我的非常相似.但是,我无法在评论部分发布我的发现,因为我没有足够的代表.

解决方案

我在问自己同样的问题,并想知道为什么它不会改变.通过查看 原始论文(第 2 页),人们会看到 self._lr 步长(在论文中用 alpha 设计)是算法需要的,但从未更新.我们还看到有一个 alpha_t 为每个 t 步骤更新,并且应该对应于 self._lr_t 属性.但实际上,正如您所观察到的,在训练过程中的任何一点评估 self._lr_t 张量的值总是返回初始值,即 _lr.

因此,据我所知,您的问题是如何获取 TensorFlow AdamOptimizer 的 alpha_t,如论文第 2 部分和相应的 TF v1.2 API 页面:

<块引用>

alpha_t = alpha * sqrt(1-beta_2_t)/(1-beta_1_t)

背景

正如您所观察到的,_lr_t 张量不会在整个训练过程中发生变化,这可能会导致优化器不适应的错误结论(这可以通过切换到 _lr_tem>vanilla GradientDescentOptimizer 具有相同的 alpha).而且,事实上,其他值确实会发生变化:快速查看优化器的 __dict__ 会显示以下键: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1'、'_beta1_power'、'_beta2'、'_updated_lr'、'_name'、'_use_locking'、'_beta2_t'、'_beta2_power'、'_epsilon'、'_slots'].

通过训练检查它们,我注意到只有 _beta1_power_beta2_power_slots 得到更新.

进一步检查 _beta1初始化,每次迭代后将乘以_beta_1_t,即也用codep_a1初始化>

但这里出现了令人困惑的部分:_beta1_t_beta2_t 永远不会更新,因此它们有效地保存了初始值 (_beta1_beta2) 通过整个训练,以与 _lrlr_t 类似的方式与论文的符号相矛盾.我想这是有原因的,但我个人不知道为什么,无论如何,这是实现的受保护/私有属性(因为它们以下划线开头)并且不属于公共接口(它们甚至可能会在 TF 版本之间发生变化).

所以经过这个小背景我们可以看到 _beta_1_power_beta_2_power 是对当前训练步骤取幂的原始 beta 值,即相当于引用的变量beta_t 在论文中.回到论文第 2 部分中 alpha_t 的定义,我们看到,有了这些信息,实现起来应该非常简单:

解决方案

optimizer = tf.train.AdamOptimizer()# 图表的其余部分...# ... 在你的会话中的某个地方# 请注意,a0 来自标量,而 bb1 和 bb2 来自张量,因此必须进行评估a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()at = a0* (1-bb2)**0.5/(1-bb1)打印(在)

变量 at 保存当前训练步骤的 alpha_t.

免责声明

我找不到一种更简洁的方法来仅通过使用优化器的界面来获取此值,但请告诉我它是否存在!我想没有,这实际上对绘制 alpha_t 的实用性提出了质疑,因为 它不依赖于数据.

此外,为了完成这些信息,论文的第 2 节还给出了权重更新的公式,这更能说明问题,但情节也更密集.对于一个非常漂亮和好看的实现,你可能想看看帖子中的这个不错的答案你链接的.

希望有帮助!干杯,
安德烈斯

I would like to see how the learning rate changes during training (print it out or create a summary and visualize it in tensorboard).

Here is a code snippet from what I have so far:

optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

sess.run(tf.initialize_all_variables())

for i in range(0, 10000):
   sess.run(train_op)
   print sess.run(optimizer._lr_t)

If I run the code I constantly get the initial learning rate (1e-3) i.e. I see no change.

What is a correct way for getting the learning rate at every step?

I would like to add that this question is really similar to mine. However, I cannot post my findings in the comment section there since I do not have enough rep.

解决方案

I was asking myself the exact same question, and wondering why wouldn't it change. By looking at the original paper (page 2), one sees that the self._lr stepsize (designed with alpha in the paper) is required by the algorithm, but never updated. We also see that there is an alpha_t that is updated for every t step, and should correspond to the self._lr_t attribute. But in fact, as you observe, evaluating the value for the self._lr_t tensor at any point during the training returns always the initial value, that is, _lr.

So your question, as I understood it, is how to get the alpha_t for TensorFlow's AdamOptimizer as described in section 2 of the paper and in the corresponding TF v1.2 API page:

alpha_t = alpha * sqrt(1-beta_2_t) / (1-beta_1_t)

BACKGROUND

As you observed, the _lr_t tensor doesn't change thorough the training, which may lead to the false conclusion that the optimizer doesn't adapt (this can be easily tested by switching to the vanilla GradientDescentOptimizer with the same alpha). And, in fact, other values do change: a quick look at the optimizer's __dict__ shows the following keys: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1', '_beta1_power', '_beta2', '_updated_lr', '_name', '_use_locking', '_beta2_t', '_beta2_power', '_epsilon', '_slots'].

By inspecting them through training, I noticed that only _beta1_power, _beta2_power and the _slots get updated.

Further inspecting the optimizer's code, in line 211, we see the following update:

update_beta1 = self._beta1_power.assign(
        self._beta1_power * self._beta1_t,
        use_locking=self._use_locking)

Which basically means that _beta1_power, which is initialized with _beta1, will be multiplied by _beta_1_t after every iteration, which is also initialized with beta_1_t.

But here comes the confusing part: _beta1_t and _beta2_t never get updated, so effectively they hold the initial values (_beta1and _beta2) through the whole training, contradicting the notation of the paper in a similar fashion as _lr and lr_t do. I guess this is for a reason but I personally don't know why, in any case this are protected/private attributes of the implementation (as they start with an underscore) and don't belong to the public interface (they may even change among TF versions).

So after this small background we can see that _beta_1_power and _beta_2_power are the original beta values exponentiated to the current training step, that is, the equivalent to the variables referred with beta_tin the paper. Going back to the definition of alpha_t in the section 2 of the paper, we see that, with this information, it should be pretty straightforward to implement:

SOLUTION

optimizer = tf.train.AdamOptimizer()
# rest of the graph...

# ... somewhere in your session
# note that a0 comes from a scalar, whereas bb1 and bb2 come from tensors and thus have to be evaluated
a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()
at = a0* (1-bb2)**0.5 /(1-bb1)
print(at)

The variable at holds the alpha_t for the current training step.

DISCLAIMER

I couldn't find a cleaner way of getting this value by just using the optimizer's interface, but please let me know if it exists one! I guess there is none, which actually puts into question the usefulness of plotting alpha_t, since it does not depend on the data.

Also, to complete this information, section 2 of the paper also gives the formula for the weight updates, which is much more telling, but also more plot-intensive. For a very nice and good-looking implementation of that, you may want to take a look at this nice answer from the post that you linked.

Hope it helps! Cheers,
Andres

这篇关于TensorFlow 中 AdamOptimizer 的学习率不会改变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆