Adam优化器在200k批处理后陷入麻烦,训练损失增加 [英] Adam optimizer goes haywire after 200k batches, training loss grows

查看:160
本文介绍了Adam优化器在200k批处理后陷入麻烦,训练损失增加的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在训练网络时,我一直看到一种非常奇怪的行为,在经过几次100k迭代(8到10个小时)的学习后,一切都中断了,训练损失也有所增长。增长

I've been seeing a very strange behavior when training a network, where after a couple of 100k iterations (8 to 10 hours) of learning fine, everything breaks and the training loss grows:

训练数据本身是随机的,分布在许多 .tfrecord 文件,每个文件包含 1000 个示例,然后在输入阶段再次洗牌并批量处理为 200 示例。

The training data itself is randomized and spread across many .tfrecord files containing 1000 examples each, then shuffled again in the input stage and batched to 200 examples.

我正在设计一个执行四种不同回归的网络同时执行任务,例如确定物体出现在图像中的可能性并同时确定其方向。该网络从几个卷积层开始,其中一些具有残差连接,然后分支到四个完全连接的段中。

I am designing a network that performs four different regression tasks at the same time, e.g. determining the likelihood of an object appearing in the image and simultanously determining its orientation. The network starts with a couple of convolutional layers, some with residual connections, and then branches into the four fully-connected segments.

由于第一次回归产生了概率,我使用交叉熵来弥补损失,而其他人则使用经典L2距离。但是,由于其性质,概率损失大约在 .. 1 左右,而定向损失可能更大,例如 0 ..10 。我已经标准化了输入和输出值,并使用了裁剪

Since the first regression results in a probability, I'm using cross entropy for the loss, whereas the others use classical L2 distance. However, due to their nature, the probability loss is around the order of 0..1, while the orientation losses can be much larger, say 0..10. I already normalized both input and output values and use clipping

normalized = tf.clip_by_average_norm(inferred.sin_cos, clip_norm=2.)

在情况可能真的变糟的情况下。

in cases where things can get really bad.

我一直(成功地)使用Adam优化器对包含所有明显损失(而不是 reduce_sum 的损失)的张量进行优化,如下所示:

I've been (successfully) using the Adam optimizer to optimize on the tensor containing all distinct losses (rather than reduce_suming them), like so:

reg_loss = tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
loss = tf.pack([loss_probability, sin_cos_mse, magnitude_mse, pos_mse, reg_loss])

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
                                   epsilon=self.params.adam_epsilon)
op_minimize = optimizer.minimize(loss, global_step=global_step)

为了在TensorBoard中显示结果,我实际上做了

In order to display the results in TensorBoard, I then actually do

loss_sum = tf.reduce_sum(loss)

获取标量摘要。

Adam设置为学习率 1e-4 和epsilon 1e-4 (我看到了相同的行为,并且Epislon的默认值为将学习率保持在 1e-3 )。正则化对此也没有任何影响,它在某些时候始终如一地进行这种操作。

Adam is set to learning rate 1e-4 and epsilon 1e-4 (I see the same behavior with the default value for epislon and it breaks even faster when I keep the learning rate on 1e-3). Regularization also has no influence on this one, it does this sort-of consistently at some point.

我还应该补充一点,即停止训练并从最后一个检查点重新开始-暗示训练输入文件也将再次洗牌-导致相同的行为。那时的训练似乎总是类似。

I should also add that stopping the training and restarting from the last checkpoint - implying that the training input files are shuffled again as well - results in the same behavior. The training always seems to behave similarly at that point.

推荐答案

是的。这是亚当的一个已知问题。

Yes. This is a known problem of Adam.

亚当的方程式是

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

其中 m 是指数平均梯度和 v 的移动平均值是梯度平方的指数移动平均值。问题是,当您经过长时间的训练并接近最佳训练时, v 会变得很小。如果突然之间,梯度又开始增加,它将被一个很小的数除并爆炸。

where m is an exponential moving average of the mean gradient and v is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.

默认情况下 beta1 = 0.9 beta2 = 0.999 。因此, m 的变化要比 v 快得多。因此 m 可以再次变大,而 v 仍然很小并且无法追赶。

By default beta1=0.9 and beta2=0.999. So m changes much more quickly than v. So m can start being big again while v is still small and cannot catch up.

要解决此问题,您可以增加 epsilon ,即 10-8 默认情况下。这样就消除了几乎被0除的问题。
根据您的网络, 0.1 epsilon 值>, 0.01 0.001 可能很好。

To remedy to this problem you can increase epsilon which is 10-8 by default. Thus stopping the problem of dividing almost by 0. Depending on your network, a value of epsilon in 0.1, 0.01, or 0.001 might be good.

这篇关于Adam优化器在200k批处理后陷入麻烦,训练损失增加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆