我们应该为亚当优化器做学习率衰减吗 [英] Should we do learning rate decay for adam optimizer

查看:102
本文介绍了我们应该为亚当优化器做学习率衰减吗的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Adam 优化器训练图像定位网络,有人建议我使用指数衰减.我不想尝试,因为 Adam 优化器本身会衰减学习率.但是那个人坚持说他以前这样做过.那么我应该这样做吗?你的建议背后有什么理论依据吗?

I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?

推荐答案

视情况而定.ADAM 使用单独的学习率更新任何参数.这意味着网络中的每个参数都有一个特定的相关学习率.

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.

但是每个参数的单个学习率是使用 lambda(初始学习率)作为上限来计算的.这意味着每个学习率都可以从 0(无更新)到 lambda(最大更新).

But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).

确实,学习率会在训练步骤中自行调整,但如果您想确保每个更新步骤不超过 lambda,您可以使用指数衰减或其他方法降低 lambda.当使用先前关联的 lambda 参数计算的损失已停止减少时,它可以帮助减少训练的最后一步期间的损失.

It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever. It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.

这篇关于我们应该为亚当优化器做学习率衰减吗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆