如何使用debug_info解释caffe日志? [英] How to interpret caffe log with debug_info?

查看:127
本文介绍了如何使用debug_info解释caffe日志?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在培训过程中遇到困难时( nan s 'solver.prototxt'文件中的debug_info: true .

培训日志如下:

I1109 ...]     [Forward] Layer data, top blob data data: 0.343971    
I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0
I1109 ...]     [Forward] Layer relu1, top blob conv1 data: 0.0337982
I1109 ...]     [Forward] Layer conv2, top blob conv2 data: 0.0249297
I1109 ...]     [Forward] Layer conv2, param blob 0 data: 0.00875855
I1109 ...]     [Forward] Layer conv2, param blob 1 data: 0
I1109 ...]     [Forward] Layer relu2, top blob conv2 data: 0.0128249
. 
.
.
I1109 ...]     [Forward] Layer fc1, top blob fc1 data: 0.00728743
I1109 ...]     [Forward] Layer fc1, param blob 0 data: 0.00876866
I1109 ...]     [Forward] Layer fc1, param blob 1 data: 0
I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506
I1109 ...]     [Backward] Layer fc1, bottom blob conv6 diff: 0.00107067
I1109 ...]     [Backward] Layer fc1, param blob 0 diff: 0.483772
I1109 ...]     [Backward] Layer fc1, param blob 1 diff: 4079.72
.
.
.
I1109 ...]     [Backward] Layer conv2, bottom blob conv1 diff: 5.99449e-06
I1109 ...]     [Backward] Layer conv2, param blob 0 diff: 0.00661093
I1109 ...]     [Backward] Layer conv2, param blob 1 diff: 0.10995
I1109 ...]     [Backward] Layer relu1, bottom blob conv1 diff: 2.87345e-06
I1109 ...]     [Backward] Layer conv1, param blob 0 diff: 0.0220984
I1109 ...]     [Backward] Layer conv1, param blob 1 diff: 0.0429201
E1109 ...]     [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)

这是什么意思?

解决方案

乍一看,您可以看到此日志部分分为两个部分:[Forward][Backward].回想一下,神经网络训练是通过向前-向后传播完成的:
训练示例(批量)被馈送到网络,并且正向传递输出当前的预测.
基于该预测,计算损失. 然后可以得出损耗,并使用链规则估算梯度并向后传播. /p>

Caffe Blob 数据结构
只需快速回顾一下. Caffe使用Blob数据结构来存储数据/权重/参数等.对于此讨论,必须注意Blob具有两个部分":datadiff. Blob的值存储在data部分中. diff部分用于在反向传播步骤中存储逐元素梯度.

前进通行证

您将在日志的此部分中看到从下到上列出的所有层.对于每一层,您都会看到:

I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

Layer "conv1"是一个有2个参数Blob的卷积层:滤镜和偏置.因此,日志有三行.过滤器Blob(param blob 0)具有data

 I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114

即卷积滤波器权重的当前L2范数为0.00899.
电流偏置(param blob 1):

 I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

表示当前偏差设置为0.

最后但并非最不重要的是,"conv1"层具有名为"conv1"的输出"top"(原始程度如何...).输出的L2范数是

 I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037

请注意,[Forward]遍的所有L2值都在所讨论的Blob的data部分上报告.

损耗和梯度
[Forward]传递的末尾是损失层:

I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506

在此示例中,批次损失为2031.85,即损失w.r.t.的梯度.计算fc1并将其传递给fc1 Blob的 diff 部分.梯度的L2大小为0.1245.

后退通行证
该部分的其余所有层从上到下列出.您可以看到,现在报告的L2幅度属于Blob的diff部分(参数和图层的输入).

最后
此迭代的最后一个日志行:

[Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)

报告数据和梯度的总L1和L2量级.

我应该寻找什么?

  1. 如果您有 nan ,看看您的数据或差异在什么时候变成nan:在哪一层?哪个迭代?

  2. 看看梯度幅度,它们应该是合理的.如果您开始看到e+8的值,则数据/梯度开始爆炸.降低学习率!

  3. 请确保diff不为零.零差异意味着没有梯度=没有更新=没有学习.如果您从随机权重开始,请考虑生成方差更高的随机权重.

  4. 查找激活次数(而不是渐变次数)为零.如果您使用的是 "ReLU" ,这意味着您的输入/权重会将您引向以下区域ReLU门未激活"导致死了神经元".考虑将输入标准化为零均值,在ReLU中添加["BatchNorm"][6] layers, setting negative_slope`.

When facing difficulties during training (nans, loss does not converge, etc.) it is sometimes useful to look at more verbose training log by setting debug_info: true in the 'solver.prototxt' file.

The training log then looks something like:

I1109 ...]     [Forward] Layer data, top blob data data: 0.343971    
I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0
I1109 ...]     [Forward] Layer relu1, top blob conv1 data: 0.0337982
I1109 ...]     [Forward] Layer conv2, top blob conv2 data: 0.0249297
I1109 ...]     [Forward] Layer conv2, param blob 0 data: 0.00875855
I1109 ...]     [Forward] Layer conv2, param blob 1 data: 0
I1109 ...]     [Forward] Layer relu2, top blob conv2 data: 0.0128249
. 
.
.
I1109 ...]     [Forward] Layer fc1, top blob fc1 data: 0.00728743
I1109 ...]     [Forward] Layer fc1, param blob 0 data: 0.00876866
I1109 ...]     [Forward] Layer fc1, param blob 1 data: 0
I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506
I1109 ...]     [Backward] Layer fc1, bottom blob conv6 diff: 0.00107067
I1109 ...]     [Backward] Layer fc1, param blob 0 diff: 0.483772
I1109 ...]     [Backward] Layer fc1, param blob 1 diff: 4079.72
.
.
.
I1109 ...]     [Backward] Layer conv2, bottom blob conv1 diff: 5.99449e-06
I1109 ...]     [Backward] Layer conv2, param blob 0 diff: 0.00661093
I1109 ...]     [Backward] Layer conv2, param blob 1 diff: 0.10995
I1109 ...]     [Backward] Layer relu1, bottom blob conv1 diff: 2.87345e-06
I1109 ...]     [Backward] Layer conv1, param blob 0 diff: 0.0220984
I1109 ...]     [Backward] Layer conv1, param blob 1 diff: 0.0429201
E1109 ...]     [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)

What does it mean?

解决方案

At first glance you can see this log section divided into two: [Forward] and [Backward]. Recall that neural network training is done via forward-backward propagation:
A training example (batch) is fed to the net and a forward pass outputs the current prediction.
Based on this prediction a loss is computed. The loss is then derived, and a gradient is estimated and propagated backward using the chain rule.

Caffe Blob data structure
Just a quick re-cap. Caffe uses Blob data structure to store data/weights/parameters etc. For this discussion it is important to note that Blob has two "parts": data and diff. The values of the Blob are stored in the data part. The diff part is used to store element-wise gradients for the backpropagation step.

Forward pass

You will see all the layers from bottom to top listed in this part of the log. For each layer you'll see:

I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

Layer "conv1" is a convolution layer that has 2 param blobs: the filters and the bias. Consequently, the log has three lines. The filter blob (param blob 0) has data

 I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114

That is the current L2 norm of the convolution filter weights is 0.00899.
The current bias (param blob 1):

 I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

meaning that currently the bias is set to 0.

Last but not least, "conv1" layer has an output, "top" named "conv1" (how original...). The L2 norm of the output is

 I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037

Note that all L2 values for the [Forward] pass are reported on the data part of the Blobs in question.

Loss and gradient
At the end of the [Forward] pass comes the loss layer:

I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506

In this example the batch loss is 2031.85, the gradient of the loss w.r.t. fc1 is computed and passed to diff part of fc1 Blob. The L2 magnitude of the gradient is 0.1245.

Backward pass
All the rest of the layers are listed in this part top to bottom. You can see that the L2 magnitudes reported now are of the diff part of the Blobs (params and layers' inputs).

Finally
The last log line of this iteration:

[Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)

reports the total L1 and L2 magnitudes of both data and gradients.

What should I look for?

  1. If you have nans in your loss, see at what point your data or diff turns into nan: at which layer? at which iteration?

  2. Look at the gradient magnitude, they should be reasonable. IF you are starting to see values with e+8 your data/gradients are starting to blow up. Decrease your learning rate!

  3. See that the diffs are not zero. Zero diffs mean no gradients = no updates = no learning. If you started from random weights, consider generating random weights with higher variance.

  4. Look for activations (rather than gradients) going to zero. If you are using "ReLU" this means your inputs/weights lead you to regions where the ReLU gates are "not active" leading to "dead neurons". Consider normalizing your inputs to have zero mean, add ["BatchNorm"][6] layers, settingnegative_slope` in ReLU.

这篇关于如何使用debug_info解释caffe日志?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆