张量流中的损失裁剪(在 DeepMind 的 DQN 上) [英] Loss clipping in tensor flow (on DeepMind's DQN)
问题描述
我正在尝试使用 Deepmind 在张量流中自己实现的 DQN 论文,并且在裁剪损失函数时遇到了困难.
I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.
这是描述损失剪裁的自然论文的摘录:
Here is an excerpt from the nature paper describing the loss clipping:
我们还发现将更新中的误差项剪裁在 -1 和 1 之间是有帮助的.因为绝对值损失函数 |x|对于 x 的所有负值具有 -1 的导数,对于 x 的所有正值具有 1 的导数,将平方误差剪裁在 -1 和 1 之间对应于对 (-1,1) 间隔.这种错误裁剪形式进一步提高了算法的稳定性.
We also found it helpful to clip the error term from the update to be between −1 and 1. Because the absolute value loss function |x| has a derivative of −1 for all negative values of x and a derivative of 1 for all positive values of x, clipping the squared error to be between −1 and 1 corresponds to using an absolute value loss function for errors outside of the (−1,1) interval. This form of error clipping further improved the stability of the algorithm.
(全文链接:http://www.Nature.com/nature/journal/v518/n7540/full/nature14236.html)
到目前为止我尝试过的是使用
What I have tried so far is using
clipped_loss_vec = tf.clip_by_value(loss, -1, 1)
削减我在 -1 和 +1 之间计算的损失.在这种情况下,代理没有学习正确的策略.我打印出网络的梯度,发现如果损失低于-1,梯度会突然变成 0!
to clip the loss I calculate between -1 and +1. The agent is not learning the proper policy in this case. I printed out the gradients of the network and realized that if the loss falls below -1, the gradients all suddenly turn to 0!
我对这种情况的推理是,裁剪损失是 (-inf,-1) U (1,inf) 中的一个常数函数,这意味着它在这些区域的梯度为零.这反过来又确保了整个网络的梯度为零(想象一下,无论我向网络提供什么输入图像,本地邻域的损失都保持在 -1,因为它已被剪裁).
My reasoning for this happening is that the clipped loss is a constant function in (-inf,-1) U (1,inf), which means it has zero gradient in those regions. This in turn ensures that the gradients throughout the network are zero (think of it as, whatever input image I provide the network, the loss stays at -1 in the local neighborhood because it has been clipped).
所以,我的问题分为两部分:
So, my question is two parts:
Deepmind 在摘录中究竟是什么意思?他们是否意味着低于 -1 的损失被剪裁为 -1,高于 +1 的被剪裁为 +1.如果是这样,他们是如何处理梯度的(即绝对值函数的所有部分是什么?)
What exactly did Deepmind mean in the excerpt? Did they mean that the loss below -1 is clipped to -1 and above +1 is clipped to +1. If so, how did they deal with the gradients (i.e. what is all that part about absolute value functions?)
我应该如何在张量流中实现损失裁剪,以便梯度在裁剪范围外不会变为零(但可能保持在 +1 和 -1)?谢谢!
How should I implement loss clipping in tensor flow such that the gradients do not go to zero outside the clipped range (but maybe stay at +1 and -1)? Thanks!
推荐答案
我怀疑他们的意思是你应该将 gradient 剪辑到 [-1,1],而不是剪辑 损失函数.因此,您像往常一样计算梯度,然后将梯度的每个分量剪裁到 [-1,1] 范围内(因此,如果它大于 +1,则将其替换为 +1;如果它小于-1,你用-1代替);然后在梯度下降更新步骤中使用结果,而不是使用未修改的梯度.
I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.
等价地:定义一个函数f
如下:
f(x) = x^2 if x in [-0.5,0.5]
f(x) = |x| - 0.25 if x < -0.5 or x > 0.5
他们建议使用 f(s)
作为损失函数.这是平方损失和绝对值损失之间的某种混合:当 s
很小时,将表现得像 s^2
,但是当 s
变大,它将表现得像绝对值 (|s|
).
Instead of using something of the form s^2
as the loss function (where s
is some complicated expression), they suggest to use f(s)
as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2
when s
is small, but when s
gets larger, it will behave like the absolute value (|s|
).
注意 f
的导数有一个很好的性质,即它的导数总是在 [-1,1] 范围内:
Notice that the derivative of f
has the nice property that its derivative will always be in the range [-1,1]:
f'(x) = 2x if x in [-0.5,0.5]
f'(x) = +1 if x > +1
f'(x) = -1 if x < -1
因此,当你取这个基于 f
的损失函数的梯度时,结果将与计算平方损失的梯度然后裁剪它是一样的.
Thus, when you take the gradient of this f
-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.
因此,他们正在做的是用 Huber 损失有效地替换平方损失.函数 f
是 delta = 0.5 时 Huber 损失的两倍.
Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f
is just two times the Huber loss for delta = 0.5.
现在的重点是以下两种选择是等价的:
Now the point is that the following two alternatives are equivalent:
使用平方损失函数.计算这个损失函数的梯度,但是在做梯度下降的更新步骤之前,梯度为[-1,1].
Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.
使用 Huber 损失函数而不是平方损失函数.在梯度下降中直接(不变)计算这个损失函数的梯度.
Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.
前者很容易实现.后者具有很好的特性(提高稳定性;它比绝对值损失更好,因为它避免了在最小值附近振荡).因为两者是等价的,这意味着我们得到了一个易于实现的方案,它具有平方损失的简单性和 Huber 损失的稳定性和鲁棒性.
The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.
这篇关于张量流中的损失裁剪(在 DeepMind 的 DQN 上)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!