Deep-Learning Nan丢失原因 [英] Deep-Learning Nan loss reasons

查看:29
本文介绍了Deep-Learning Nan丢失原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

也许这个问题太笼统了,但谁能解释一下什么会导致卷积神经网络发散?

Perhaps too general a question, but can anyone explain what would cause a Convolutional Neural Network to diverge?

规格:

我正在将 Tensorflow 的 iris_training 模型与我自己的一些数据一起使用并不断获取

I am using Tensorflow's iris_training model with some of my own data and keep getting

错误:张量流:模型发散,损失 = NaN.

ERROR:tensorflow:Model diverged with loss = NaN.

追溯...

tensorflow.contrib.learn.python.learn.monitors.NanLossDuringTrainingError:训练期间的 NaN 损失.

tensorflow.contrib.learn.python.learn.monitors.NanLossDuringTrainingError: NaN loss during training.

回溯起源于行:

 tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                        hidden_units=[300, 300, 300],
                                        #optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.001, l1_regularization_strength=0.00001),                                                          
                                        n_classes=11,
                                        model_dir="/tmp/iris_model")

我尝试过调整优化器,使用零作为学习率,并且不使用优化器.感谢您对网络层、数据大小等的任何见解.

I've tried adjusting the optimizer, using a zero for learning rate, and using no optimizer. Any insights into network layers, data size, etc is appreciated.

推荐答案

我见过很多让模型产生分歧的事情.

There are lots of things I have seen make a model diverge.

  1. 学习率太高.如果损失开始增加然后发散到无穷大,您通常可以判断是否是这种情况.

  1. Too high of a learning rate. You can often tell if this is the case if the loss begins to increase and then diverges to infinity.

我不熟悉 DNNClassifier,但我猜它使用了分类交叉熵成本函数.这涉及采用随着预测接近零而发散的预测对数.这就是为什么人们通常会在预测中添加一个小的 epsilon 值以防止出现这种差异.我猜 DNNClassifier 可能会这样做或使用 tensorflow opp.可能不是问题.

I am not to familiar with the DNNClassifier but I am guessing it uses the categorical cross entropy cost function. This involves taking the log of the prediction which diverges as the prediction approaches zero. That is why people usually add a small epsilon value to the prediction to prevent this divergence. I am guessing the DNNClassifier probably does this or uses the tensorflow opp for it. Probably not the issue.

可能存在其他数值稳定性问题,例如除以零时,添加 epsilon 会有所帮助.如果在处理有限精度数时没有适当简化,那么导数的平方根会发散,这是另一个不太明显的问题.我再次怀疑这是 DNNClassifier 的问题.

Other numerical stability issues can exist such as division by zero where adding the epsilon can help. Another less obvious one if the square root who's derivative can diverge if not properly simplified when dealing with finite precision numbers. Yet again I doubt this is the issue in the case of the DNNClassifier.

您的输入数据可能有问题.尝试在输入数据上调用 assert not np.any(np.isnan(x)) 以确保您没有引入 nan.还要确保所有目标值都有效.最后,确保数据已正确规范化.您可能希望像素在 [-1, 1] 而不是 [0, 255] 范围内.

You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are not introducing the nan. Also make sure all of the target values are valid. Finally, make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].

标签必须在损失函数的域中,所以如果使用基于对数的损失函数,所有标签都必须是非负的(如 evan pu 和下面的评论所述).

The labels must be in the domain of the loss function, so if using a logarithmic-based loss function all labels must be non-negative (as noted by evan pu and the comments below).

这篇关于Deep-Learning Nan丢失原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆