tensorflow 在用稀疏张量计算梯度时给出 nans [英] tensorflow giving nans when calculating gradient with sparse tensors

查看:32
本文介绍了tensorflow 在用稀疏张量计算梯度时给出 nans的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码段来自一段相当大的代码,但希望我能提供所有必要的信息:

y2 = tf.matmul(y1,ymask)

dist = tf.norm(ystar-y2,axis=0)

y1 和 y2 是 128x30,ymask 是 30x30.ystar 是 128x30.距离是 1x30.当 ymask 是单位矩阵时,一切正常.但是当我将它设置为全零时,除了沿对角线的单个 1(以便将 y2 中除 1 之外的所有列设置为零)时,我使用 tf.梯度(dist,[y2]).dist 的具体值为 [0,0,7.9,0,...],所有 ystar-y2 值都在第三列中的范围 (-1,1) 附近,其他地方为零.

我很困惑为什么这里会出现数字问题,因为没有日志或分区,这是下溢吗?我在数学中遗漏了什么吗?

对于上下文,我这样做是为了尝试使用整个网络训练 y 的各个维度,一次一个.

要重现的更长版本:

 将 tensorflow 导入为 tf将 numpy 导入为 np将熊猫导入为 pd批量大小 = 128eta = 0.8任务 = 30图像大小 = 32**2组 = 3任务每组 = 10训练数据点 = 10000w = np.zeros([imageSize, groups * tasksPerGroup])玩具索引 = 0对于范围(组)中的 toyLoop:m = np.ones([imageSize]) * np.random.randn(imageSize)对于范围内的任务循环(tasksPerGroup):w[:, toyIndex] = m * 0.1 * np.random.randn(1)玩具索引 += 1xRand = np.random.normal(0, 0.5, (trainDatapoints, imageSize))taskLabels = np.matmul(xRand, w) + np.random.normal(0,0.5,(trainDatapoints, groups * tasksPerGroup))DF = np.concatenate((xRand, taskLabels), axis=1)trainDF = pd.DataFrame(DF[:trainDatapoints, ])# 定义图变量x = tf.placeholder(tf.float32, [None, imageSize])W = tf.Variable(tf.zeros([imageSize, tasks]))b = tf.Variable(tf.zeros([任务]))ystar = tf.placeholder(tf.float32, [无, 任务])ymask = tf.placeholder(tf.float32, [任务, 任务])dataLength = tf.cast(tf.shape(ystar)[0],dtype=tf.float32)y1 = tf.matmul(x, W) + by2 = tf.matmul(y1,ymask)dist = tf.norm(ystar-y2,axis=0)mse = tf.reciprocal(dataLength) * tf.reduce_mean(tf.square(dist))grads = tf.gradients(dist, [y2])trainStep = tf.train.GradientDescentOptimizer(eta).minimize(mse)# 构建图init = tf.global_variables_initializer()sess = tf.Session()sess.run(初始化)randTask = np.random.randint(0, 9)ymaskIn = np.zeros([任务,任务])ymaskIn[randTask, randTask] = 1批次 = trainDF.sample(batchSize)batch_xs = batch.iloc[:, :imageSize]batch_ys = np.zeros([batchSize, tasks])batch_ys[:, randTask] = batch.iloc[:, imageSize + randTask]gradOut = sess.run(grads, feed_dict={x: batch_xs, ystar: batch_ys, ymask: ymaskIn})sess.run(trainStep,feed_dict={x:batch_xs,ystar:batch_ys,ymask:ymaskIn})

解决方案

这是一个非常简单的复制:

 将 tensorflow 导入为 tf使用 tf.Graph().as_default():y = tf.zeros(shape=[1], dtype=tf.float32)dist = tf.norm(y,axis=0)(grad,) = tf.gradients(dist, [y])使用 tf.Session():打印(grad.eval())

打印:

[ nan]

问题在于 tf.norm 计算 sum(x**2)**0.5.梯度为 x/sum(x**2) ** 0.5(参见例如 https://math.stackexchange.com/a/84333),所以当 sum(x**2) 为零时,我们除以零.

就特殊情况而言,没有什么可做的:当 x 接近全零时,梯度取决于它从哪个方向接近.例如,如果 x 是一个单元素向量,则 x 接近 0 的限制可能是 1 或 -1,这取决于它从零的哪一侧接近.>

所以在解决方案方面,你可以只添加一个小的epsilon:

 将 tensorflow 导入为 tfdef safe_norm(x, epsilon=1e-12, axis=None):返回 tf.sqrt(tf.reduce_sum(x ** 2,axis=axis) + epsilon)使用 tf.Graph().as_default():y = tf.constant([0.])dist = safe_norm(y,axis=0)(grad,) = tf.gradients(dist, [y])使用 tf.Session():打印(grad.eval())

打印:

<代码>[ 0.]

请注意,这实际上不是欧几里得范数.只要输入远大于epsilon,这是一个很好的近似值.

The following snippet is from a fairly large piece of code but hopefully I can give all the information necessary:

y2 = tf.matmul(y1,ymask)

dist = tf.norm(ystar-y2,axis=0)

y1 and y2 are 128x30 and ymask is 30x30. ystar is 128x30. dist is 1x30. When ymask is the identity matrix, everything works fine. But when I set it to be all zeros, apart from a single 1 along the diagonal (so as to set all columns but one in y2 to be zero), I get nans for the gradient of dist with respect to y2, using tf.gradients(dist, [y2]). The specific value of dist is [0,0,7.9,0,...], with all the ystar-y2 values being around the range (-1,1) in the third column and zero elsewhere.

I'm pretty confused as to why a numerical issue would occur here, given there are no logs or divisions, is this underflow? Am I missing something in the maths?

For context, I'm doing this to try to train individual dimensions of y, one at a time, using the whole network.

longer version to reproduce:

import tensorflow as tf
import numpy as np
import pandas as pd

batchSize = 128
eta = 0.8
tasks = 30
imageSize = 32**2
groups = 3
tasksPerGroup = 10
trainDatapoints = 10000

w = np.zeros([imageSize, groups * tasksPerGroup])
toyIndex = 0
for toyLoop in range(groups):
    m = np.ones([imageSize]) * np.random.randn(imageSize)
    for taskLoop in range(tasksPerGroup):
        w[:, toyIndex] = m * 0.1 * np.random.randn(1)
        toyIndex += 1

xRand = np.random.normal(0, 0.5, (trainDatapoints, imageSize))
taskLabels = np.matmul(xRand, w) + np.random.normal(0,0.5,(trainDatapoints, groups * tasksPerGroup))
DF = np.concatenate((xRand, taskLabels), axis=1)
trainDF = pd.DataFrame(DF[:trainDatapoints, ])

# define graph variables
x = tf.placeholder(tf.float32, [None, imageSize])
W = tf.Variable(tf.zeros([imageSize, tasks]))
b = tf.Variable(tf.zeros([tasks]))
ystar = tf.placeholder(tf.float32, [None, tasks])
ymask = tf.placeholder(tf.float32, [tasks, tasks])
dataLength = tf.cast(tf.shape(ystar)[0],dtype=tf.float32)

y1 = tf.matmul(x, W) + b
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
mse = tf.reciprocal(dataLength) * tf.reduce_mean(tf.square(dist))
grads = tf.gradients(dist, [y2])

trainStep = tf.train.GradientDescentOptimizer(eta).minimize(mse)

# build graph
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

randTask = np.random.randint(0, 9)
ymaskIn = np.zeros([tasks, tasks])
ymaskIn[randTask, randTask] = 1
batch = trainDF.sample(batchSize)
batch_xs = batch.iloc[:, :imageSize]
batch_ys = np.zeros([batchSize, tasks])
batch_ys[:, randTask] = batch.iloc[:, imageSize + randTask]

gradOut = sess.run(grads, feed_dict={x: batch_xs, ystar: batch_ys, ymask: ymaskIn})

sess.run(trainStep, feed_dict={x: batch_xs, ystar: batch_ys, ymask:ymaskIn})

解决方案

Here's a very simple reproduction:

import tensorflow as tf

with tf.Graph().as_default():
  y = tf.zeros(shape=[1], dtype=tf.float32)
  dist = tf.norm(y,axis=0)
  (grad,) = tf.gradients(dist, [y])
  with tf.Session():
    print(grad.eval())

Prints:

[ nan]

The issue is that tf.norm computes sum(x**2)**0.5. The gradient is x / sum(x**2) ** 0.5 (see e.g. https://math.stackexchange.com/a/84333), so when sum(x**2) is zero we're dividing by zero.

There's not much to be done in terms of a special case: the gradient as x approaches all zeros depends on which direction it's approaching from. For example if x is a single-element vector, the limit as x approaches 0 could either be 1 or -1 depending on which side of zero it's approaching from.

So in terms of solutions, you could just add a small epsilon:

import tensorflow as tf

def safe_norm(x, epsilon=1e-12, axis=None):
  return tf.sqrt(tf.reduce_sum(x ** 2, axis=axis) + epsilon)

with tf.Graph().as_default():
  y = tf.constant([0.])
  dist = safe_norm(y,axis=0)
  (grad,) = tf.gradients(dist, [y])
  with tf.Session():
    print(grad.eval())

Prints:

[ 0.]

Note that this is not actually the Euclidean norm. It's a good approximation as long as the input is much larger than epsilon.

这篇关于tensorflow 在用稀疏张量计算梯度时给出 nans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆