为什么在TF2.0中使用梯度带对分类交叉熵损失相对于logit的梯度为0? [英] Why the gradient of categorical crossentropy loss with respect to logits is 0 with gradient tape in TF2.0?
问题描述
我正在学习Tensorflow 2.0,并且试图弄清楚Gradient Tapes是如何工作的.我有一个简单的示例,其中评估了logit和标签之间的交叉熵损失.我想知道为什么逻辑对数的梯度为零. (请查看下面的代码). TF的版本是tensorflow-gpu == 2.0.0-rc0.
I am learning Tensorflow 2.0 and I am trying to figure out how Gradient Tapes work. I have this simple example, in which, I evaluate the cross entropy loss between logits and labels. I am wondering why the gradients with respect to logits is being zero. (Please look at the code below). The version of TF is tensorflow-gpu==2.0.0-rc0.
logits = tf.Variable([[1, 0, 0], [1, 0, 0], [1, 0, 0]], type=tf.float32)
labels = tf.constant([[1, 0, 0], [0, 1, 0], [0, 0, 1]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.losses.categorical_crossentropy(labels, logits))
grads = tape.gradient(loss, logits)
print(grads)
我得到了
tf.Tensor(
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]], shape=(3, 3), dtype=float32)
因此,但是它不应该告诉我应该更改多少logit以最大程度地减少损失吗?
as a result, but should not it tell me how much should I change logits in order to minimize the loss?
推荐答案
计算交叉熵损失时,在tf.losses.categorical_crossentropy()
中设置from_logits=True
.默认情况下,它为false,这意味着您可以直接使用-p*log(q)
计算交叉熵损失.通过设置from_logits=True
,您正在使用-p*log(softmax(q))
来计算损耗.
When calculate the cross entropy loss, set from_logits=True
in the tf.losses.categorical_crossentropy()
. In default, it's false, which means you are directly calculate the cross entropy loss using -p*log(q)
. By setting the from_logits=True
, you are using -p*log(softmax(q))
to calculate the loss.
更新:
只需找到一个有趣的结果.
Just find one interesting results.
logits = tf.Variable([[0.8, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))
grads = tape.gradient(loss, logits)
print(grads)
应届毕业生为tf.Tensor([[-0.25 1. 1. ]], shape=(1, 3), dtype=float32)
以前,我认为张量流将使用loss=-\Sigma_i(p_i)\log(q_i)
来计算损失,如果我们在q_i
上进行推导,我们将使导数为-p_i/q_i
.因此,预期的毕业成绩应为[-1.25, 0, 0]
.但是输出等级似乎都增加了1.但这不会影响优化过程.
Previously, I thought tensorflow will use loss=-\Sigma_i(p_i)\log(q_i)
to calculate the loss, and if we derive on q_i
, we will have the derivative be -p_i/q_i
. So, the expected grads should be [-1.25, 0, 0]
. But the output grads looks like all increased by 1. But it won't affect the optimization process.
就目前而言,我仍在努力弄清为什么将研究生人数增加一.阅读 tf.categorical_crossentropy ,我发现即使我们设置了from_logits=False
,它仍然可以归一化概率.那将改变最终的梯度表达式.具体而言,梯度将为-p_i/q_i+p_i/sum_j(q_j)
.如果p_i=1
和sum_j(q_j)=1
,则最终渐变将加一.这就是为什么渐变将为-0.25
的原因,但是我还没有弄清楚为什么最后两个渐变将为1.
.
For now, I'm still trying to figure out why the grads will be increased by one. After reading the source code of tf.categorical_crossentropy, I found that even though we set from_logits=False
, it still normalize the probabilities. That will change the final gradient expression. Specifically, the gradient will be -p_i/q_i+p_i/sum_j(q_j)
. If p_i=1
and sum_j(q_j)=1
, the final gradient will plus one. That's why the gradient will be -0.25
, however, I haven't figured out why the last two gradients would be 1.
.
为了证明所有梯度都增加了1/sum_j(q_j)
,
To prove that all gradients are increased by 1/sum_j(q_j)
,
logits = tf.Variable([[0.5, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))
grads = tape.gradient(loss, logits)
print(grads)
应届毕业生是tf.Tensor([[-0.57142866 1.4285713 1.4285713 ]]
,应为[-2,0,0]
.
The grads are tf.Tensor([[-0.57142866 1.4285713 1.4285713 ]]
, which should be [-2,0,0]
.
它显示所有梯度都增加了1/(0.5+0.1+0.1)
.对于p_i==1
,增加1/(0.5+0.1+0.1)
的坡度对我来说很有意义.但是我不明白为什么p_i==0
的梯度仍然会增加1/(0.5+0.1+0.1)
.
It shows that all gradients are increased by 1/(0.5+0.1+0.1)
. For the p_i==1
, the gradient increased by 1/(0.5+0.1+0.1)
makes sense to me. But I don't understand why p_i==0
, the gradient is still increased by 1/(0.5+0.1+0.1)
.
这篇关于为什么在TF2.0中使用梯度带对分类交叉熵损失相对于logit的梯度为0?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!