张量流中的损失裁剪(在DeepMind的DQN上) [英] Loss clipping in tensor flow (on DeepMind's DQN)

查看:221
本文介绍了张量流中的损失裁剪(在DeepMind的DQN上)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Deepmind自己的DQN论文在张量流中的实现,并且在裁剪损失函数时遇到了困难.

I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.

以下是自然论文的摘录,描述了损失削减:

Here is an excerpt from the nature paper describing the loss clipping:

我们还发现将更新中的误差项限制在-1和1之间是有帮助的.因为绝对值损失函数| x |对于x的所有负值具有-1的导数,对于x的所有正值具有-1的导数,将平方误差裁剪为-1和1之间对应于对(--之外的误差使用绝对值损失函数1,1)间隔.这种错误裁剪的形式进一步提高了算法的稳定性.

We also found it helpful to clip the error term from the update to be between −1 and 1. Because the absolute value loss function |x| has a derivative of −1 for all negative values of x and a derivative of 1 for all positive values of x, clipping the squared error to be between −1 and 1 corresponds to using an absolute value loss function for errors outside of the (−1,1) interval. This form of error clipping further improved the stability of the algorithm.

(链接到全文: http://www. nature.com/nature/journal/v518/n7540/full/nature14236.html )

到目前为止,我尝试使用的是

What I have tried so far is using

clipped_loss_vec = tf.clip_by_value(loss, -1, 1)

将我计算出的损失限制在-1和+1之间.在这种情况下,代理没有学习适当的策略.我打印出网络的渐变,并意识到,如果损耗降到-1以下,则渐变都会突然变为0!

to clip the loss I calculate between -1 and +1. The agent is not learning the proper policy in this case. I printed out the gradients of the network and realized that if the loss falls below -1, the gradients all suddenly turn to 0!

我发生这种情况的原因是,削波损耗是(-inf,-1)U(1,inf)中的常数函数,这意味着它在那些区域中的梯度为零.反过来,这又确保了整个网络的梯度为零(考虑到它,无论我提供网络的任何输入图像,由于被裁剪,损耗在本地邻域中均保持为-1).

My reasoning for this happening is that the clipped loss is a constant function in (-inf,-1) U (1,inf), which means it has zero gradient in those regions. This in turn ensures that the gradients throughout the network are zero (think of it as, whatever input image I provide the network, the loss stays at -1 in the local neighborhood because it has been clipped).

所以,我的问题分为两个部分:

So, my question is two parts:

  1. Deepmind在摘录中究竟是什么意思?他们的意思是说,低于-1的损失被限制为-1,而高于+1的损失被限制为+1.如果是这样,它们如何处理梯度(即,绝对值函数的那部分是什么?)

  1. What exactly did Deepmind mean in the excerpt? Did they mean that the loss below -1 is clipped to -1 and above +1 is clipped to +1. If so, how did they deal with the gradients (i.e. what is all that part about absolute value functions?)

我应该如何在张量流中实现损耗限幅,以使梯度在限幅范围之外不会变为零(但可能保持在+1和-1)? 谢谢!

How should I implement loss clipping in tensor flow such that the gradients do not go to zero outside the clipped range (but maybe stay at +1 and -1)? Thanks!

推荐答案

我怀疑它们的意思是您应该将 gradient 裁剪为[-1,1],而不是裁剪 loss函数.因此,您可以照常计算梯度,但随后将梯度的每个分量裁剪为[-1,1]范围(因此,如果它大于+1,则将其替换为+1;如果它小于+1,则将其替换为+1). -1,则将其替换为-1);然后在梯度下降更新步骤中使用结果,而不是使用未修改的梯度.

I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.

等效地:如下定义函数f:

f(x) = x^2          if x in [-0.5,0.5]
f(x) = |x| - 0.25   if x < -0.5 or x > 0.5

他们建议不要使用s^2形式的损失函数(其中s是一些复杂的表达式),而是建议使用f(s)作为损失函数.这是平方损失与绝对值损失之间的某种混合:当s较小时,其行为类似于s^2,但是当s较大时,其行为类似于绝对值(|s|)

Instead of using something of the form s^2 as the loss function (where s is some complicated expression), they suggest to use f(s) as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2 when s is small, but when s gets larger, it will behave like the absolute value (|s|).

请注意,f的导数具有很好的属性,即其导数将始终在[-1,1]范围内:

Notice that the derivative of f has the nice property that its derivative will always be in the range [-1,1]:

f'(x) = 2x    if x in [-0.5,0.5]
f'(x) = +1    if x > +1
f'(x) = -1    if x < -1

因此,当您使用基于f的损失函数的梯度时,结果将与计算平方损失的梯度并对其进行裁剪相同.

Thus, when you take the gradient of this f-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.

因此,他们在做什么实际上是用休伯损失代替平方损失. .函数f仅是delta = 0.5时Huber损耗的两倍.

Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f is just two times the Huber loss for delta = 0.5.

现在的要点是,以下两个选择是等效的:

Now the point is that the following two alternatives are equivalent:

  • 使用平方损失函数.计算此损失函数的梯度,但在进行梯度下降的更新步骤之前,将梯度计算为[-1,1].

  • Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.

使用Huber损失函数代替平方损失函数.在梯度下降时直接(不变)计算该损失函数的梯度.

Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.

前者易于实现.后者具有不错的属性(提高了稳定性;它比绝对值损失更好,因为它避免了在最小值附近波动).因为两者是等效的,所以这意味着我们得到了一个易于实现的方案,该方案具有平方损失的简单性以及Huber损失的稳定性和鲁棒性.

The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.

这篇关于张量流中的损失裁剪(在DeepMind的DQN上)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆