tf.GradientTape 为梯度返回 None [英] tf.GradientTape returns None for gradient

查看:33
本文介绍了tf.GradientTape 为梯度返回 None的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tf.GradientTape().gradient() 来计算

在上面的等式中,L 是标准损失函数(多类分类的分类交叉熵),phi 是 pre-softmax 激活输出(所以它的长度是类的数量).此外,alpha_i 可以进一步分解为 alpha_ij,它是针对特定类 j 计算的.因此,我们只得到与测试示例的预测类别(最终预测最高的类别)对应的 pre-softmax 输出 phi_j.

我使用 MNIST 创建了一个简单的设置并实现了以下内容:

def simple_mnist_cnn(input_shape = (28,28,1)):输入 = 输入(形状=输入形状)x = layers.Conv2D(32, kernel_size=(3, 3), activation=relu")(input)x = layer.MaxPooling2D(pool_size=(2, 2))(x)x = layers.Conv2D(64, kernel_size=(3, 3), activation=relu")(x)x = layer.MaxPooling2D(pool_size=(2, 2))(x)x = layers.Flatten()(x) # 特征表示output = layers.Dense(num_classes, activation=None)(x) # presoftmax 激活输出activation = layers.Activation(activation='softmax')(output) # 带有激活的最终输出模型 = tf.keras.Model(输入,[x,输出,激活],名称=mnist_model")回报模式

现在假设模型已经过训练,我想计算给定训练示例对给定测试示例预测的影响,可能是为了模型理解/调试目的.

 使用 tf.GradientTape() 作为 t1:f_t, _, pred_t = model(x_t) # 获取错误分类示例的特征f_i,presoftmax_i,pred_i = 模型(x_i)# 计算 x_t 和 x_i 的特征表示的点积dotps = tf.reduce_sum(tf.multiply(f_t, f_i))# 得到对应于 x_t 的最高预测类的 presoftmax 输出phi_ij = presoftmax_i[:,np.argmax(pred_t)]# y_i 是 x_i 的实际标签cl_loss_i = tf.keras.losses.categorical_crossentropy(pred_i, y_i)alpha_ij = t1.gradient(cl_loss_i, phi_ij)# 注意:alpha_ij 当前返回 Nonek_ij = tf.reduce_sum(tf.multiply(alpha_i, dotps))

上面的代码给出了以下错误,因为 alpha_ij 是 None:ValueError: Attempt to convert a value (None) with an unsupported type () to an Tensor..但是,如果我更改 t1.gradient(cl_loss_i, phi_ij) ->t1.gradient(cl_loss_i, presoftmax_i),它不再返回 None.不知道为什么会这样?在切片张量上计算梯度有问题吗?观看"是否有问题?变量太多?我很少使用渐变胶带,所以我不确定修复是什么,但希望得到帮助.

对于任何感兴趣的人,这里有更多详细信息:

解决方案

我从没见过你watch 任何张量.请注意,磁带默认仅跟踪 tf.Variable.这是您的代码中缺少的吗?否则我看不到 t1.gradient(cl_loss_i, presoftmax_i) 是如何工作的.

不管怎样,我认为解决它的最简单方法是做

all_gradients = t1.gradient(cl_loss_i, presoftmax_i)required_gradients = all_gradients[[:,np.argmax(pred_t)]]

所以只需在渐变之后进行索引即可.请注意,这可能很浪费(如果有很多类),因为您计算的梯度比您需要的要多.

为什么(我相信)您的版本不起作用的解释最容易在绘图中显示,但让我尝试解释:想象一下有向图中的计算.我们有

presoftmax_i ->pred_i ->cl_loss_i

将损失反向传播到 presoftmax 很容易.但是后来你又设立了一个分支,

presoftmax_i ->presoftmax_ij

现在,当您尝试计算关于 presoftmax_ij 的损失梯度时,实际上没有反向传播路径(我们只能向后跟随箭头).另一种思考方式:您在计算损失之后计算presoftmax_ij .那么损失怎么可能取决于它?

I am using tf.GradientTape().gradient() to compute a representer point, which can be used to compute the "influence" of a given training example on a given test example. A representer point for a given test example x_t and training example x_i is computed as the dot product of their feature representations, f_t and f_i, multiplied by a weight alpha_i.

Note: The details of this approach are not necessary for understanding the question, since the main issue is getting gradient tape to work. That being said, I have included a screenshot of the some of the details below for anyone who is interested.

Computing alpha_i requires differentiation, since it is expressed as the following:

In the equation above L is the standard loss function (categorical cross-entropy for multiclass classification) and phi is the pre-softmax activation output (so its length is the number of classes). Furthermore alpha_i can be further broken up into alpha_ij, which is computed with respect to a specific class j. Therefore, we just obtain the pre-softmax output phi_j corresponding to the predicted class of the test example (class with highest final prediction).

I have created a simple setup with MNIST and have implemented the following:

def simple_mnist_cnn(input_shape = (28,28,1)):
  input = Input(shape=input_shape)
  x = layers.Conv2D(32, kernel_size=(3, 3), activation="relu")(input)
  x = layers.MaxPooling2D(pool_size=(2, 2))(x)
  x = layers.Conv2D(64, kernel_size=(3, 3), activation="relu")(x)
  x = layers.MaxPooling2D(pool_size=(2, 2))(x)
  x = layers.Flatten()(x) # feature representation 
  output = layers.Dense(num_classes, activation=None)(x) # presoftmax activation output 
  activation = layers.Activation(activation='softmax')(output) # final output with activation 
  model = tf.keras.Model(input, [x, output, activation], name="mnist_model")
  return model

Now assume the model is trained, and I want to compute the influence of a given train example on a given test example's prediction, perhaps for model understanding/debugging purposes.

with tf.GradientTape() as t1:
  f_t, _, pred_t = model(x_t) # get features for misclassified example
  f_i, presoftmax_i, pred_i = model(x_i)

  # compute dot product of feature representations for x_t and x_i
  dotps = tf.reduce_sum(
            tf.multiply(f_t, f_i))

  # get presoftmax output corresponding to highest predicted class of x_t
  phi_ij = presoftmax_i[:,np.argmax(pred_t)]

  # y_i is actual label for x_i
  cl_loss_i = tf.keras.losses.categorical_crossentropy(pred_i, y_i)

alpha_ij = t1.gradient(cl_loss_i, phi_ij)
# note: alpha_ij returns None currently
k_ij = tf.reduce_sum(tf.multiply(alpha_i, dotps))

The code above gives the following error, since alpha_ij is None: ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.. However, if I change t1.gradient(cl_loss_i, phi_ij) -> t1.gradient(cl_loss_i, presoftmax_i), it no longer returns None. Not sure why this is the case? Is there an issue with computing gradients on sliced tensors? Is there an issue with "watching" too many variables? I haven't worked much with gradient tape so I'm not sure what the fix is, but would appreciate help.

For anyone who is interested, here are more details:

解决方案

I never see you watch any tensors. Note that the tape only traces tf.Variable by default. Is this missing from your code? Else I don't see how t1.gradient(cl_loss_i, presoftmax_i) is working.

Either way, I think the easiest way to fix it is to do

all_gradients = t1.gradient(cl_loss_i, presoftmax_i)
desired_gradients = all_gradients[[:,np.argmax(pred_t)]]

so simply do the indexing after the gradient. Note that this can be wasteful (if there are many classes) as you are computing more gradients than you need.

The explanation for why (I believe) your version doesn't work would be easiest to show in a drawing, but let me try to explain: Imagine the computations in a directed graph. We have

presoftmax_i -> pred_i -> cl_loss_i

Backpropagating the loss to the presoftmax is easy. But then you set up another branch,

presoftmax_i -> presoftmax_ij

Now, when you try to compute the gradient of the loss with respect to presoftmax_ij, there is actually no backpropagation path (we can only follow arrows backwards). Another way to think about it: You compute presoftmax_ij after computing the loss. How could the loss depend on it then?

这篇关于tf.GradientTape 为梯度返回 None的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆