了解神经网络反向传播 [英] Understanding Neural Network Backpropagation

查看:27
本文介绍了了解神经网络反向传播的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:对问题的更好表述.

我正在尝试以 XOR 神经网络为例来理解反向传播算法.对于这种情况,有 2 个输入神经元 + 1 个偏置,隐藏层中的 2 个神经元 + 1 个偏置,以及 1 个输出神经元.

 A B A XOR B1 1 -11 -1 1-1 1 1-1 -1 -1


(来源:
第3步我们使用第2步的误差来计算每个隐藏单元h的误差:

'权重 kh' 是隐藏单元 h 和输出单元 k 之间的权重,这是令人困惑的,因为输入单元没有与输出单元相关的直接权重.盯着公式看了几个小时后,我开始思考求和的含义,我开始得出结论,每个输入神经元连接到隐藏层神经元的权重乘以输出误差并求和.这是一个合乎逻辑的结论,但该公式似乎有点混乱,因为它清楚地说明了权重 kh"(在输出层 k 和隐藏层 h 之间).

我是否正确理解了这里的所有内容?有人可以证实这一点吗?

输入层的 O(h) 是多少?我的理解是每个输入节点都有两个输出:一个进入隐藏层的第一个节点,一个进入第二个节点隐藏层.应该将两个输出中的哪一个插入公式的 O(h)*(1 - O(h)) 部分?

解决方案

您在此处发布的教程实际上做错了.我根据 Bishop 的两本标准书籍和我的两本工作实现仔细检查了它.我会在下面指出具体在哪里.

要记住的一件重要事情是,您总是在寻找关于单位或重量的误差函数的导数.前者是增量,后者是你用来更新权重的东西.

如果你想了解反向传播,你必须了解链式法则.这都是关于链式法则的.如果您不知道它究竟是如何工作的,请查看维基百科——这并不难.但是一旦你理解了这些推导,一切都会水到渠成.承诺!:)

∂E/∂W 可以通过链式法则组合成∂E/∂o ∂o/∂W.∂o/∂W 很容易计算,因为它只是一个单元的激活/输出相对于权重的导数.∂E/∂o 实际上就是我们所说的增量.(我在这里假设 E、o 和 W 是向量/矩阵)

我们确实有它们作为输出单元,因为这是我们可以计算误差的地方.(大多数情况下,我们有一个误差函数,它归结为 (t_k - o_k) 的 delta,例如线性输出的二次误差函数和逻辑输出的交叉熵.)

现在的问题是,我们如何获得内部单位的导数?嗯,我们知道一个单元的输出是所有输入单元的总和,按它们的权重加权,然后应用传递函数.所以 o_k = f(sum(w_kj * o_j, 对于所有 j)).

所以我们要做的是,相对于 o_j 导出 o_k.由于 delta_j = ∂E/∂o_j = ∂E/∂o_k ∂o_k/∂o_j = delta_k ∂o_k/o_j.所以给定delta_k,我们可以计算delta_j!

让我们这样做.o_k = f(sum(w_kj * o_j, for all j)) => ∂o_k/∂o_j = f'(sum(w_kj * o_j, for all j)) * w_kj = f'(z_k) * w_kj.

对于 sigmoidal 传递函数的情况,这变为 z_k(1 - z_k) * w_kj.(这里是教程中的错误,作者说的是o_k(1 - o_k) * w_kj!)

Update: a better formulation of the issue.

I'm trying to understand the backpropagation algorithm with an XOR neural network as an example. For this case there are 2 input neurons + 1 bias, 2 neurons in the hidden layer + 1 bias, and 1 output neuron.

 A   B  A XOR B
 1    1   -1
 1   -1    1
-1    1    1
-1   -1   -1


(source: wikimedia.org)

I'm using stochastic backpropagation.

After reading a bit more I have found out that the error of the output unit is propagated to the hidden layers... initially this was confusing, because when you get to the input layer of the neural network, then each neuron gets an error adjustment from both of the neurons in the hidden layer. In particular, the way the error is distributed is difficult to grasp at first.

Step 1 calculate the output for each instance of input.
Step 2 calculate the error between the output neuron(s) (in our case there is only one) and the target value(s):

Step 3 we use the error from Step 2 to calculate the error for each hidden unit h:

The 'weight kh' is the weight between the hidden unit h and the output unit k, well this is confusing because the input unit does not have a direct weight associated with the output unit. After staring at the formula for a few hours I started to think about what the summation means, and I'm starting to come to the conclusion that each input neuron's weight that connects to the hidden layer neurons is multiplied by the output error and summed up. This is a logical conclusion, but the formula seems a little confusing since it clearly says the 'weight kh' (between the output layer k and hidden layer h).

Am I understanding everything correctly here? Can anybody confirm this?

What's O(h) of the input layer? My understanding is that each input node has two outputs: one that goes into the the first node of the hidden layer and one that goes into the second node hidden layer. Which of the two outputs should be plugged into the O(h)*(1 - O(h)) part of the formula?

解决方案

The tutorial you posted here is actually doing it wrong. I double checked it against Bishop's two standard books and two of my working implementations. I will point out below where exactly.

An important thing to keep in mind is that you are always searching for derivatives of the error function with respect to a unit or weight. The former are the deltas, the latter is what you use to update your weights.

If you want to understand backpropagation, you have to understand the chain rule. It's all about the chain rule here. If you don't know how it works exactly, check up at wikipedia - it's not that hard. But as soon as you understand the derivations, everything falls into place. Promise! :)

∂E/∂W can be composed into ∂E/∂o ∂o/∂W via the chain rule. ∂o/∂W is easily calculated, since it's just the derivative of the activation/output of a unit with respect to the weights. ∂E/∂o is actually what we call the deltas. (I am assuming that E, o and W are vectors/matrices here)

We do have them for the output units, since that is where we can calculate the error. (Mostly we have an error function that comes down to delta of (t_k - o_k), eg for quadratic error function in the case of linear outputs and cross entropy in case for logistic outputs.)

The question now is, how do we get the derivatives for the internal units? Well, we know that the output of a unit is the sum of all incoming units weighted by their weights and the application of a transfer function afterwards. So o_k = f(sum(w_kj * o_j, for all j)).

So what we do is, derive o_k with respect to o_j. Since delta_j = ∂E/∂o_j = ∂E/∂o_k ∂o_k/∂o_j = delta_k ∂o_k/o_j. So given delta_k, we can calculate delta_j!

Let's do this. o_k = f(sum(w_kj * o_j, for all j)) => ∂o_k/∂o_j = f'(sum(w_kj * o_j, for all j)) * w_kj = f'(z_k) * w_kj.

For the case of the sigmoidal transfer function, this becomes z_k(1 - z_k) * w_kj. (Here is the error in the tutorial, the author says o_k(1 - o_k) * w_kj!)

这篇关于了解神经网络反向传播的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆