训练回归网络时NaN损失 [英] NaN loss when training regression network

查看:145
本文介绍了训练回归网络时NaN损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个单热编码"(全1和全0)的数据矩阵,具有260,000行和35列.我正在使用Keras训练简单的神经网络来预测连续变量.建立网络的代码如下:

I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:

model = Sequential()
model.add(Dense(1024, input_shape=(n_train,)))
model.add(Activation('relu'))
model.add(Dropout(0.1))

model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.1))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(1))

sgd = SGD(lr=0.01, nesterov=True);
#rms = RMSprop()
#model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
model.compile(loss='mean_absolute_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )

但是,在训练过程中,我看到损失减少得很好,但是在第二个时期的中间,它就变成了nan:

However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:

Train on 260000 samples, validate on 64905 samples
Epoch 1/3
260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
 13.4925
Epoch 2/3
 88448/260000 [=========>....................] - ETA: 161s - loss: nan

我尝试使用RMSProp而不是SGD,我尝试了tanh而不是relu,尝试了是否有辍学,都无济于事.我尝试了一种较小的模型,即仅具有一个隐藏层,并且存在相同的问题(在不同的点变得难解难分).但是,它确实具有较少的功能,即如果只有5列,并且给出了很好的预测.似乎有某种溢出,但我无法想象为什么-损失根本不是不合理的大.

I tried using RMSProp instead of SGD, I tried tanh instead of relu, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.

Python版本2.7.11,仅在CPU上的Linux机器上运行.我使用最新版的Theano进行了测试,并且我也得到了Nans,因此我尝试使用Theano 0.8.2并遇到了同样的问题.与Keras的最新版本具有相同的问题,并且与0.3.2版本也存在相同的问题.

Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.

推荐答案

由于输出无界,因此难以实现神经网络的回归,因此您特别容易使用

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).

从历史上看,解决梯度爆炸的一个关键方法是降低学习率,但是随着像Adam这样的每参数自适应学习率算法的出现,您不再需要设置学习率即可获得良好的性能.除非您是神经网络的恶魔并且知道如何调整学习时间表,否则几乎没有理由再使用SGD了.

Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.

以下是您可以尝试的一些方法:

Here are some things you could potentially try:

  1. 通过分位数归一化 z得分.为严格起见,请根据训练数据而非整个数据集计算此转换.例如,在分位数归一化的情况下,如果示例位于训练集的第60个百分位数中,则其值为0.6. (您还可以将分位数归一化值向下移动0.5,以使第0个百分位数为-0.5,第100个百分位数为+0.5).

  1. Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).

通过增加辍学率或对权重添加L1和L2罚分来添加正则化. L1正则化类似于特征选择,并且由于您说过将特征数减少到5可以提供良好的性能,因此L1也可能会如此.

Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.

如果这些仍然不能解决问题,请减小网络的大小.这并不总是最好的主意,因为它会影响性能,但是在您的情况下,相对于输入特征(35),您拥有大量的第一层神经元(1024),因此可能会有所帮助.

If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.

将批次大小从32增加到128.128是相当标准的操作,可能会增加优化的稳定性.

Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

这篇关于训练回归网络时NaN损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆