如何在 TensorFlow 中调试 NaN 值? [英] How does one debug NaN values in TensorFlow?

查看:24
本文介绍了如何在 TensorFlow 中调试 NaN 值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行 TensorFlow,但我碰巧有一些产生 NaN 的东西.我想知道它是什么,但我不知道如何做到这一点.主要问题是,在正常"过程程序中,我只会在执行操作之前编写一个打印语句.TensorFlow 的问题是我不能这样做,因为我首先声明(或定义)了图形,因此向图形定义添加打印语句无济于事.是否有任何规则、建议、启发式方法或任何方法可以追踪可能导致 NaN 的原因?

I was running TensorFlow and I happen to have something yielding a NaN. I'd like to know what it is but I do not know how to do this. The main issue is that in a "normal" procedural program I would just write a print statement just before the operation is executed. The issue with TensorFlow is that I cannot do that because I first declare (or define) the graph, so adding print statements to the graph definition does not help. Are there any rules, advice, heuristics, anything to track down what might be causing the NaN?

在这种情况下,我更准确地知道要查看哪一行,因为我有以下几点:

In this case I know more precisely what line to look at because I have the following:

Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance
Z = tf.sqrt(Delta_tilde)
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
A = tf.exp(Z) 

当这一行出现时,我知道它返回 NaN,正如我的摘要作者所声明的那样.为什么是这样?有没有办法至少探索 Z 平方根后的值?

when this line is present I have it that it returns NaN as declared by my summary writers. Why is this? Is there a way to at least explore what value Z has after its being square rooted?

对于我发布的具体示例,我尝试了 tf.Print(0,Z) 但没有成功,它什么也没打印.如:

For the specific example I posted, I tried tf.Print(0,Z) but with no success it printed nothing. As in:

Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance
Z = tf.sqrt(Delta_tilde)
tf.Print(0,[Z]) # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
A = tf.exp(Z) 

我实际上不明白 tf.Print 应该做什么.为什么它需要两个参数?如果我想打印 1 张量,为什么我需要通过 2?对我来说似乎很奇怪.

I actually don't understand what tf.Print is suppose to do. Why does it need two arguments? If I want to print 1 tensor why would I need to pass 2? Seems bizarre to me.

我正在查看函数 tf.add_check_numerics_ops() 但它没有说明如何使用它(而且文档似乎不是很有帮助).有人知道怎么用吗?

I was looking at the function tf.add_check_numerics_ops() but it doesn't say how to use it (plus the docs seem to not be super helpful). Does anyone know how to use this?

因为我有评论指出数据可能不好,所以我使用标准的 MNIST.但是,我正在计算一个正数(成对欧几里德距离),然后将其平方根.因此,我不会看到数据具体会有什么问题.

Since I've had comments addressing the data might be bad, I am using standard MNIST. However, I am computing a quantity that is positive (pair-wise eucledian distance) and then square rooting it. Thus, I wouldn't see how the data specifically would be an issue.

推荐答案

为什么会得到 NaN 结果的原因有很多,通常是因为学习率太高,但还有很多其他原因,例如例如输入队列中的损坏数据或 0 计算的日志.

There are a couple of reasons WHY you can get a NaN-result, often it is because of too high a learning rate but plenty other reasons are possible like for example corrupt data in your input-queue or a log of 0 calculation.

无论如何,不​​能通过简单的打印来完成您描述的打印调试(因为这只会导致打印图形内的张量信息,而不打印任何实际值).

Anyhow, debugging with a print as you describe cannot be done by a simple print (as this would result only in the printing of the tensor-information inside the graph and not print any actual values).

但是,如果您使用 tf.print 作为构建图形的操作 (tf.print) 然后当图形被执行时,您将打印实际值(观察这些值以调试和了解网络的行为是一个很好的练习).

However, if you use tf.print as an op in bulding the graph (tf.print) then when the graph gets executed you will get the actual values printed (and it IS a good exercise to watch these values to debug and understand the behavior of your net).

但是,您使用的打印语句并不完全正确.这是一个操作,因此您需要向它传递张量并请求您稍后需要在执行图中使用的结果张量.否则操作将不会被执行并且不会发生打印.试试这个:

However, you are using the print-statement not entirely in the correct manner. This is an op, so you need to pass it a tensor and request a result-tensor that you need to work with later on in the executing graph. Otherwise the op is not going to be executed and no printing occurs. Try this:

Z = tf.sqrt(Delta_tilde)
Z = tf.Print(Z,[Z], message="my Z-values:") # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)

这篇关于如何在 TensorFlow 中调试 NaN 值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆