为什么在TF2.0中使用梯度带对分类交叉熵损失相对于logit的梯度为0? [英] Why the gradient of categorical crossentropy loss with respect to logits is 0 with gradient tape in TF2.0?

查看:136
本文介绍了为什么在TF2.0中使用梯度带对分类交叉熵损失相对于logit的梯度为0?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习Tensorflow 2.0,并且试图弄清楚Gradient Tapes是如何工作的.我有一个简单的示例,其中评估了logit和标签之间的交叉熵损失.我想知道为什么逻辑对数的梯度为零. (请查看下面的代码). TF的版本是tensorflow-gpu == 2.0.0-rc0.

I am learning Tensorflow 2.0 and I am trying to figure out how Gradient Tapes work. I have this simple example, in which, I evaluate the cross entropy loss between logits and labels. I am wondering why the gradients with respect to logits is being zero. (Please look at the code below). The version of TF is tensorflow-gpu==2.0.0-rc0.

logits = tf.Variable([[1, 0, 0], [1, 0, 0], [1, 0, 0]], type=tf.float32)
labels = tf.constant([[1, 0, 0], [0, 1, 0], [0, 0, 1]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
    loss = tf.reduce_sum(tf.losses.categorical_crossentropy(labels, logits))

grads = tape.gradient(loss, logits)
print(grads)

我得到了

 tf.Tensor(
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]], shape=(3, 3), dtype=float32)

因此,但是它不应该告诉我应该更改多少logit以最大程度地减少损失吗?

as a result, but should not it tell me how much should I change logits in order to minimize the loss?

推荐答案

计算交叉熵损失时,在tf.losses.categorical_crossentropy()中设置from_logits=True.默认情况下,它为false,这意味着您可以直接使用-p*log(q)计算交叉熵损失.通过设置from_logits=True,您正在使用-p*log(softmax(q))来计算损耗.

When calculate the cross entropy loss, set from_logits=True in the tf.losses.categorical_crossentropy(). In default, it's false, which means you are directly calculate the cross entropy loss using -p*log(q). By setting the from_logits=True, you are using -p*log(softmax(q)) to calculate the loss.

更新:

只需找到一个有趣的结果.

Just find one interesting results.

logits = tf.Variable([[0.8, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)

with tf.GradientTape(persistent=True) as tape:
    loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))

grads = tape.gradient(loss, logits)
print(grads)

应届毕业生为tf.Tensor([[-0.25 1. 1. ]], shape=(1, 3), dtype=float32)

以前,我认为张量流将使用loss=-\Sigma_i(p_i)\log(q_i)来计算损失,如果我们在q_i上进行推导,我们将使导数为-p_i/q_i.因此,预期的毕业成绩应为[-1.25, 0, 0].但是输出等级似乎都增加了1.但这不会影响优化过程.

Previously, I thought tensorflow will use loss=-\Sigma_i(p_i)\log(q_i) to calculate the loss, and if we derive on q_i, we will have the derivative be -p_i/q_i. So, the expected grads should be [-1.25, 0, 0]. But the output grads looks like all increased by 1. But it won't affect the optimization process.

就目前而言,我仍在努力弄清为什么将研究生人数增加一.阅读 tf.categorical_crossentropy ,我发现即使我们设置了from_logits=False,它仍然可以归一化概率.那将改变最终的梯度表达式.具体而言,梯度将为-p_i/q_i+p_i/sum_j(q_j).如果p_i=1sum_j(q_j)=1,则最终渐变将加一.这就是为什么渐变将为-0.25的原因,但是我还没有弄清楚为什么最后两个渐变将为1..

For now, I'm still trying to figure out why the grads will be increased by one. After reading the source code of tf.categorical_crossentropy, I found that even though we set from_logits=False, it still normalize the probabilities. That will change the final gradient expression. Specifically, the gradient will be -p_i/q_i+p_i/sum_j(q_j). If p_i=1 and sum_j(q_j)=1, the final gradient will plus one. That's why the gradient will be -0.25, however, I haven't figured out why the last two gradients would be 1..

为了证明所有梯度都增加了1/sum_j(q_j)

To prove that all gradients are increased by 1/sum_j(q_j),

logits = tf.Variable([[0.5, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)

with tf.GradientTape(persistent=True) as tape:
    loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))

grads = tape.gradient(loss, logits)
print(grads)

应届毕业生是tf.Tensor([[-0.57142866 1.4285713 1.4285713 ]],应为[-2,0,0].

The grads are tf.Tensor([[-0.57142866 1.4285713 1.4285713 ]], which should be [-2,0,0].

它显示所有梯度都增加了1/(0.5+0.1+0.1).对于p_i==1,增加1/(0.5+0.1+0.1)的坡度对我来说很有意义.但是我不明白为什么p_i==0的梯度仍然会增加1/(0.5+0.1+0.1).

It shows that all gradients are increased by 1/(0.5+0.1+0.1). For the p_i==1, the gradient increased by 1/(0.5+0.1+0.1) makes sense to me. But I don't understand why p_i==0, the gradient is still increased by 1/(0.5+0.1+0.1).

这篇关于为什么在TF2.0中使用梯度带对分类交叉熵损失相对于logit的梯度为0?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆