训练回归网络时的 NaN 损失 [英] NaN loss when training regression network

查看:31
本文介绍了训练回归网络时的 NaN 损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个one-hot encoding"(全是 1 和 0)的数据矩阵,有 260,000 行和 35 列.我正在使用 Keras 训练一个简单的神经网络来预测连续变量.制作网络的代码如下:

I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:

model = Sequential()
model.add(Dense(1024, input_shape=(n_train,)))
model.add(Activation('relu'))
model.add(Dropout(0.1))

model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.1))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(1))

sgd = SGD(lr=0.01, nesterov=True);
#rms = RMSprop()
#model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
model.compile(loss='mean_absolute_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )

然而,在训练过程中,我看到损失下降得很好,但在第二个时期的中间,它变成了 nan:

However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:

Train on 260000 samples, validate on 64905 samples
Epoch 1/3
260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
 13.4925
Epoch 2/3
 88448/260000 [=========>....................] - ETA: 161s - loss: nan

我尝试使用 RMSProp 而不是 SGD,我尝试了 tanh 而不是 relu,我尝试了和没有辍学,一切都无济于事.我尝试了一个较小的模型,即只有一个隐藏层,并且存在相同的问题(在不同的点变成 nan).但是,它确实可以使用较少的功能,即如果只有 5 列,并且可以提供非常好的预测.似乎有某种溢出,但我无法想象为什么 - 损失根本没有不合理的大.

I tried using RMSProp instead of SGD, I tried tanh instead of relu, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.

Python 版本 2.7.11,在 linux 机器上运行,仅 CPU.我用最新版本的 Theano 测试了它,我也得到了 Nans,所以我尝试去 Theano 0.8.2 并遇到同样的问题.最新版本的 Keras 也有同样的问题,0.3.2 版本也是如此.

Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.

推荐答案

神经网络的回归很难工作,因为输出是无界的,所以你特别容易出现梯度爆炸问题(nans 的可能原因).

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).

过去,梯度爆炸的一个关键解决方案是降低学习率,但随着像 Adam 这样的每参数自适应学习率算法的出现,您不再需要设置学习率来获得良好的性能.除非您是神经网络狂热者并且知道如何调整学习计划,否则几乎没有理由再使用带有动量的 SGD.

Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.

以下是您可以尝试的一些事情:

Here are some things you could potentially try:

  1. 通过分位数标准化z 评分.严格来说,在训练数据上计算这种转换,而不是在整个数据集上.例如,对于分位数归一化,如果一个样本位于训练集的第 60 个百分位,则它的值为 0.6.(您还可以将分位数标准化值向下移动 0.5,以便第 0 个百分位数为 -0.5,第 100 个百分位数为 +0.5.

  1. Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).

添加正则化,通过增加辍学率或向权重添加 L1 和 L2 惩罚.L1 正则化类似于特征选择,既然你说将特征的数量减少到 5 个可以得到很好的性能,那么 L1 也可以.

Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.

如果这些仍然没有帮助,请减小您的网络规模.这并不总是最好的主意,因为它会损害性能,但在您的情况下,相对于输入特征 (35),您有大量的第一层神经元 (1024),因此它可能会有所帮助.

If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.

将批大小从 32 增加到 128.128 是相当标准的,可能会增加优化的稳定性.

Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

这篇关于训练回归网络时的 NaN 损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆