Tensorflow NaN 错误? [英] Tensorflow NaN bug?

查看:28
本文介绍了Tensorflow NaN 错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 TensorFlow 并修改了教程示例以获取我的 RGB 图像.>

该算法在新图像集上开箱即用,完美无缺,直到突然(仍在收敛,通常准确度约为 92%),它因 ReluGrad 接收到非有限值的错误而崩溃.调试显示数字没有任何异常,直到非常突然,由于未知原因,错误被抛出.添加

print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())打印最大 b 值:%g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())

作为每个循环的调试代码,产生以下输出:

步骤 8600最大 W 值:0.759422 0.295087 0.344725 0.583884最大 b 值:0.110509 0.111748 0.115327 0.124324步骤 8601最大 W 值:0.75947 0.295084 0.344723 0.583893最大 b 值:0.110516 0.111753 0.115322 0.124332步骤 8602最大 W 值:0.759521 0.295101 0.34472 0.5839最大 b 值:0.110521 0.111747 0.115312 0.124365步骤 8603最大 W 值:-3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38最大 b 值:-3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38

由于我的值都不是很高,NaN 可能发生的唯一方法是处理不当的 0/0,但由于本教程代码不执行任何除法或类似操作,因此除此之外我没有其他解释这来自内部TF代码.

我对此一无所知.有什么建议?该算法收敛性很好,它在我的验证集上的准确率稳步上升,在第 8600 次迭代时达到了 92.5%.

解决方案

实际上,结果证明这是一件愚蠢的事情.我发布这个是为了以防其他人遇到类似的错误.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))

实际上是一种计算交叉熵的可怕方法.在某些样本中,一段时间后可以肯定地排除某些类别,导致该样本的 y_conv=0.这通常不是问题,因为您对这些不感兴趣,但是按照 cross_entropy 在那里的编写方式,它会为该特定样本/类产生 0*log(0) .因此是 NaN.

替换为

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

解决了我所有的问题.

I'm using TensorFlow and I modified the tutorial example to take my RGB images.

The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding

print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())

as debug code to each loop, yields the following output:

Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38

Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.

I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.

解决方案

Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

Replacing it with

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

solved all my problems.

这篇关于Tensorflow NaN 错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆