深度学习难道失落的原因 [英] Deep-Learning Nan loss reasons

查看:100
本文介绍了深度学习难道失落的原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

也许是一个笼统的问题,但是谁能解释导致卷积神经网络发散的原因?

Perhaps too general a question, but can anyone explain what would cause a Convolutional Neural Network to diverge?

具体:

我正在使用Tensorflow的iris_training模型处理我自己的一些数据,并不断获取

I am using Tensorflow's iris_training model with some of my own data and keep getting

ERROR:tensorflow:模型因损失= NaN而发散.

ERROR:tensorflow:Model diverged with loss = NaN.

追踪...

tensorflow.contrib.learn.python.learn.monitors.NanLossDuringTrainingError:训练期间NaN丢失.

tensorflow.contrib.learn.python.learn.monitors.NanLossDuringTrainingError: NaN loss during training.

追溯源自以下行:

 tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                        hidden_units=[300, 300, 300],
                                        #optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.001, l1_regularization_strength=0.00001),                                                          
                                        n_classes=11,
                                        model_dir="/tmp/iris_model")

我尝试过调整优化器,将学习率设置为零,并且不使用优化器.任何对网络层,数据大小等的见解都将受到赞赏.

I've tried adjusting the optimizer, using a zero for learning rate, and using no optimizer. Any insights into network layers, data size, etc is appreciated.

推荐答案

我见过很多东西使模型发生分歧.

There are lots of things I have seen make a model diverge.

  1. 学习率太高.您通常可以判断出损失是否开始增加然后发散到无穷大.

  1. Too high of a learning rate. You can often tell if this is the case if the loss begins to increase and then diverges to infinity.

我不熟悉DNNClassifier,但我猜想它使用了分类交叉熵代价函数.这涉及获取预测的对数,该对数随着预测接近零而发散.这就是为什么人们通常在预测中添加较小的epsilon值以防止这种差异.我猜测DNNClassifier可能会这样做或使用tensorflow opp.可能不是问题.

I am not to familiar with the DNNClassifier but I am guessing it uses the categorical cross entropy cost function. This involves taking the log of the prediction which diverges as the prediction approaches zero. That is why people usually add a small epsilon value to the prediction to prevent this divergence. I am guessing the DNNClassifier probably does this or uses the tensorflow opp for it. Probably not the issue.

还可能存在其他数值稳定性问题,例如除以零会增加epsilon的作用.如果在处理有限精度数时未适当简化,则导数的平方根可以发散的另一种不那么明显的方法.再一次,我怀疑这是DNNClassifier的问题.

Other numerical stability issues can exist such as division by zero where adding the epsilon can help. Another less obvious one if the square root who's derivative can diverge if not properly simplified when dealing with finite precision numbers. Yet again I doubt this is the issue in the case of the DNNClassifier.

您的输入数据可能有问题.尝试在输入数据上调用assert not np.any(np.isnan(x))以确保您没有引入nan.还要确保所有目标值均有效.最后,确保数据正确归一化.您可能希望像素在[-1,1]而不是[0,255]范围内.

You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are not introducing the nan. Also make sure all of the target values are valid. Finally, make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].

标签必须在损失函数的域中,因此,如果使用基于对数的损失函数,则所有标签都必须为非负数(如evan pu和以下评论所述).

The labels must be in the domain of the loss function, so if using a logarithmic-based loss function all labels must be non-negative (as noted by evan pu and the comments below).

这篇关于深度学习难道失落的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆