训练过程中出现Nans的常见原因 [英] Common causes of nans during training

查看:593
本文介绍了训练过程中出现Nans的常见原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经注意到,在训练期间经常出现NAN.

通常情况下,它似乎是在内部产品/完全连接或卷积层爆炸中通过权重引入的.

是否由于梯度计算被炸开而发生?还是因为权重初始化(如果这样,为什么权重初始化会产生这种效果)?还是可能是由于输入数据的性质引起的?

这里的首要问题很简单:在训练过程中发生NAN的最常见原因是什么?其次,有什么方法可以消除这种情况(为什么它们起作用)?

解决方案

好问题.
我碰到这种现象来过几次.这是我的观察结果:


渐变爆炸

原因:大的梯度会使学习过程偏离轨道.

您的期望:查看运行时日志,您应该查看每迭代的损失值.您会注意到,损失从迭代到迭代开始显着增长,最终损失太大,无法用浮点变量表示,它将变为nan.

您可以做什么::将base_lr(在Solver.prototxt中)减少一个数量级(至少).如果有多个损耗的层,应检查日志以查看用于梯度炸毁和降低(在train_val.prototxt)为该特定层,而不是一般


不良学习率政策和参数

原因: caffe无法计算有效的学习率,而是获取'inf''nan',该无效率会乘以所有更新,从而使所有参数无效.

你应该期望什么:纵观运行日志,你应该看到的是,学习率本身成为<6>,例如:

... sgd_solver.cpp:106] Iteration 0, lr = -nan

您能做什么::修复所有影响'solver.prototxt'文件中学习率的参数.
例如,如果使用lr_policy: "poly"而忘记定义max_iter参数,则最终会得到lr = nan ...
有关caffe学习率的更多信息,请参见此线程.


故障损失功能

原因:有时,损耗层中损耗的计算会导致出现nan.例如,使用具有错误的自定义损失层为 InfogainLoss层提供非标准化值

>

您应该期望的内容:在运行时日志中,您可能不会注意到任何异常情况:损耗在逐渐减少,并且突然出现nan.

您能做什么::查看是否可以重现错误,将打印输出添加到损失层并调试错误.

例如:一次我使用了损失,该损失通过批次中标签出现的频率来标准化惩罚.碰巧的是,如果其中一个培训标签根本没有出现在批次中-计算得出的损失将产生nan.在那种情况下,使用足够大的批次(相对于标签中的标签数量)就足以避免此错误.


输入错误

原因:您输入的内容中包含nan

您应该期望的内容:学习过程达到"该错误的输入后,输出将变为nan.纵观运行日志中,您可能不会注意到任何异常情况:亏损逐渐减少,并突然一个出现

的.

您可以做什么::重新构建输入数据集(lmdb/leveldn/hdf5 ...),以确保训练/验证集中没有不良的图像文件.调试可以构建简单的网,读出的输入层,具有通过所有输入在它上面并运行一个虚拟损失:如果他们中的一个出现故障时,这个虚设净还应该产生


步幅大于"Pooling"层中的内核大小

由于某些原因,选择stride> kernel_size进行池化可能会导致nan.例如:

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

y中的结果为nan s.


"BatchNorm"

中的不稳定性

据报道,在某些设置下,"BatchNorm"层可能会由于数值不稳定而输出nan.
问题在bvlc/caffe和 debug_info 标志:在'solver.prototxt'中设置debug_info: true将使caffe打印在训练期间记录更多调试信息(包括梯度幅度和激活值):此信息可以 解决方案

Good question.
I came across this phenomenon several times. Here are my observations:


Gradient blow up

Reason: large gradients throw the learning process off-track.

What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.

What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.


Bad learning rate policy and params

Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.

What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:

... sgd_solver.cpp:106] Iteration 0, lr = -nan

What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.
For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...
For more information about learning rate in caffe, see
this thread.


Faulty Loss function

Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.

What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.

For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.


Faulty input

Reason: you have an input with nan in it!

What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.


stride larger than kernel size in "Pooling" layer

For some reason, choosing stride > kernel_size for pooling may results with nans. For example:

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

results with nans in y.


Instabilities in "BatchNorm"

It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.


Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.

这篇关于训练过程中出现Nans的常见原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆