张量流中的损失裁剪(在 DeepMind 的 DQN 上) [英] Loss clipping in tensor flow (on DeepMind's DQN)

查看:32
本文介绍了张量流中的损失裁剪(在 DeepMind 的 DQN 上)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Deepmind 在张量流中自己实现的 DQN 论文,并且在裁剪损失函数时遇到了困难.

I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.

这是描述损失剪裁的自然论文的摘录:

Here is an excerpt from the nature paper describing the loss clipping:

我们还发现将更新中的误差项剪裁在 -1 和 1 之间是有帮助的.因为绝对值损失函数 |x|对于 x 的所有负值具有 -1 的导数,对于 x 的所有正值具有 1 的导数,将平方误差剪裁在 -1 和 1 之间对应于对 (-1,1) 间隔.这种错误裁剪形式进一步提高了算法的稳定性.

We also found it helpful to clip the error term from the update to be between −1 and 1. Because the absolute value loss function |x| has a derivative of −1 for all negative values of x and a derivative of 1 for all positive values of x, clipping the squared error to be between −1 and 1 corresponds to using an absolute value loss function for errors outside of the (−1,1) interval. This form of error clipping further improved the stability of the algorithm.

(全文链接:http://www.Nature.com/nature/journal/v518/n7540/full/nature14236.html)

到目前为止我尝试过的是使用

What I have tried so far is using

clipped_loss_vec = tf.clip_by_value(loss, -1, 1)

削减我在 -1 和 +1 之间计算的损失.在这种情况下,代理没有学习正确的策略.我打印出网络的梯度,发现如果损失低于-1,梯度会突然变成 0!

to clip the loss I calculate between -1 and +1. The agent is not learning the proper policy in this case. I printed out the gradients of the network and realized that if the loss falls below -1, the gradients all suddenly turn to 0!

我对这种情况的推理是,裁剪损失是 (-inf,-1) U (1,inf) 中的一个常数函数,这意味着它在这些区域的梯度为零.这反过来又确保了整个网络的梯度为零(想象一下,无论我向网络提供什么输入图像,本地邻域的损失都保持在 -1,因为它已被剪裁).

My reasoning for this happening is that the clipped loss is a constant function in (-inf,-1) U (1,inf), which means it has zero gradient in those regions. This in turn ensures that the gradients throughout the network are zero (think of it as, whatever input image I provide the network, the loss stays at -1 in the local neighborhood because it has been clipped).

所以,我的问题分为两部分:

So, my question is two parts:

  1. Deepmind 在摘录中究竟是什么意思?他们是否意味着低于 -1 的损失被剪裁为 -1,高于 +1 的被剪裁为 +1.如果是这样,他们是如何处理梯度的(即绝对值函数的所有部分是什么?)

  1. What exactly did Deepmind mean in the excerpt? Did they mean that the loss below -1 is clipped to -1 and above +1 is clipped to +1. If so, how did they deal with the gradients (i.e. what is all that part about absolute value functions?)

我应该如何在张量流中实现损失裁剪,以便梯度在裁剪范围外不会变为零(但可能保持在 +1 和 -1)?谢谢!

How should I implement loss clipping in tensor flow such that the gradients do not go to zero outside the clipped range (but maybe stay at +1 and -1)? Thanks!

推荐答案

我怀疑他们的意思是你应该将 gradient 剪辑到 [-1,1],而不是剪辑 损失函数.因此,您像往常一样计算梯度,然后将梯度的每个分量剪裁到 [-1,1] 范围内(因此,如果它大于 +1,则将其替换为 +1;如果它小于-1,你用-1代替);然后在梯度下降更新步骤中使用结果,而不是使用未修改的梯度.

I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.

等价地:定义一个函数f如下:

f(x) = x^2          if x in [-0.5,0.5]
f(x) = |x| - 0.25   if x < -0.5 or x > 0.5

他们建议使用 f(s) 作为损失函数.这是平方损失和绝对值损失之间的某种混合:当 s 很小时,将表现得像 s^2,但是当 s 变大,它将表现得像绝对值 (|s|).

Instead of using something of the form s^2 as the loss function (where s is some complicated expression), they suggest to use f(s) as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2 when s is small, but when s gets larger, it will behave like the absolute value (|s|).

注意 f 的导数有一个很好的性质,即它的导数总是在 [-1,1] 范围内:

Notice that the derivative of f has the nice property that its derivative will always be in the range [-1,1]:

f'(x) = 2x    if x in [-0.5,0.5]
f'(x) = +1    if x > +1
f'(x) = -1    if x < -1

因此,当你取这个基于 f 的损失函数的梯度时,结果将与计算平方损失的梯度然后裁剪它是一样的.

Thus, when you take the gradient of this f-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.

因此,他们正在做的是用 Huber 损失有效地替换平方损失.函数 f 是 delta = 0.5 时 Huber 损失的两倍.

Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f is just two times the Huber loss for delta = 0.5.

现在的重点是以下两种选择是等价的:

Now the point is that the following two alternatives are equivalent:

  • 使用平方损失函数.计算这个损失函数的梯度,但是在做梯度下降的更新步骤之前,梯度为[-1,1].

  • Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.

使用 Huber 损失函数而不是平方损失函数.在梯度下降中直接(不变)计算这个损失函数的梯度.

Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.

前者很容易实现.后者具有很好的特性(提高稳定性;它比绝对值损失更好,因为它避免了在最小值附近振荡).因为两者是等价的,这意味着我们得到了一个易于实现的方案,它具有平方损失的简单性和 Huber 损失的稳定性和鲁棒性.

The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.

这篇关于张量流中的损失裁剪(在 DeepMind 的 DQN 上)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆