了解神经网络的反向传播 [英] Understanding Neural Network Backpropagation

查看:85
本文介绍了了解神经网络的反向传播的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:对该问题的更好阐述.

我试图以XOR神经网络为例来了解反向传播算法.在这种情况下,有2个输入神经元+ 1个偏差,隐藏层中2个神经元+ 1个偏差,和1个输出神经元.

 A   B  A XOR B
 1    1   -1
 1   -1    1
-1    1    1
-1   -1   -1


(来源: wikimedia.org )

我正在使用随机反向传播.

多读一点之后,我发现输出单元的错误传播到了隐藏层……起初这很令人困惑,因为当您进入神经网络的输入层时,每个神经元都会隐藏层中两个神经元的错误调整.特别是,错误的分布方式一开始很难掌握.

步骤1 计算每个输入实例的输出.
第2步计算输出神经元(在我们的例子中只有一个)和目标值之间的误差:

第3步,我们使用第2步中的误差来计算每个隐藏单位h的误差:

权重kh"是隐藏单元h和输出单元k之间的权重,这很容易混淆,因为输入单元没有与输出单元关联的直接权重.盯着公式看了几个小时后,我开始思考求和的含义,然后我得出的结论是,连接到隐藏层神经元的每个输入神经元的权重都乘以输出误差,然后求和.这是合乎逻辑的结论,但是该公式似乎有些混乱,因为它清楚地说明了权重kh"(在输出层k和隐藏层h之间).

我在这里正确理解所有内容吗?有人可以确认吗?

输入层的O(h)是多少?我的理解是,每个输入节点都有两个输出:一个进入隐藏层的第一个节点,另一个进入第二个节点隐藏层.应该将两个输出中的哪一个插入公式的O(h)*(1 - O(h))部分?

解决方案

您在此处发布的教程实际上做错了.我根据Bishop的两本标准书籍和我的两个有效实施方案对它进行了仔细检查.我将在下面指出确切的位置.

要记住的重要一点是,您始终在搜索误差函数相对于单位或权重的导数.前者是三角洲,后者是您用来更新权重的东西.

如果您想了解反向传播,则必须了解链式规则.这一切都与链条规则有关.如果您不知道它是如何工作的,请查阅Wikipedia-并不难.但是,一旦您了解了派生,一切就都准备就绪.承诺! :)

∂E/∂W可以通过链式规则组成∂E/∂o∂o/∂W. ∂o/∂W很容易计算,因为它只是一个单位相对于权重的激活/输出的导数. ∂E/∂o实际上​​就是我们所说的增量. (我在这里假设E,o和W是向量/矩阵)

我们确实将它们用于输出单位,因为这是我们可以计算误差的地方. (大多数情况下,我们的误差函数下降到(t_k-o_k)的增量,例如,对于线性输出,则为二次误差函数;对于逻辑输出,则为交叉熵.)

现在的问题是,我们如何获得内部单位的衍生品?好吧,我们知道一个单元的输出是所有传入单元的总和,这些单元均按其权重加权,然后再应用传递函数.因此o_k = f(sum(w_kj * o_j,对于所有j)).

所以我们要做的是,相对于o_j导出o_k.因为delta_j =∂E/jo_j =∂E/∂o_k∂o_k/∂o_j= delta_k∂o_k/o_j.因此,给定delta_k,我们可以计算delta_j!

让我们这样做. o_k = f(sum(w_kj * o_j,对于所有j))=>∂o_k/∂o_j= f'(sum(w_kj * o_j,对于所有j))* w_kj = f'(z_k)* w_kj.

对于S形传递函数,其变为z_k(1-z_k)* w_kj. (这是本教程中的错误,作者说o_k(1-o_k)* w_kj!)

Update: a better formulation of the issue.

I'm trying to understand the backpropagation algorithm with an XOR neural network as an example. For this case there are 2 input neurons + 1 bias, 2 neurons in the hidden layer + 1 bias, and 1 output neuron.

 A   B  A XOR B
 1    1   -1
 1   -1    1
-1    1    1
-1   -1   -1


(source: wikimedia.org)

I'm using stochastic backpropagation.

After reading a bit more I have found out that the error of the output unit is propagated to the hidden layers... initially this was confusing, because when you get to the input layer of the neural network, then each neuron gets an error adjustment from both of the neurons in the hidden layer. In particular, the way the error is distributed is difficult to grasp at first.

Step 1 calculate the output for each instance of input.
Step 2 calculate the error between the output neuron(s) (in our case there is only one) and the target value(s):

Step 3 we use the error from Step 2 to calculate the error for each hidden unit h:

The 'weight kh' is the weight between the hidden unit h and the output unit k, well this is confusing because the input unit does not have a direct weight associated with the output unit. After staring at the formula for a few hours I started to think about what the summation means, and I'm starting to come to the conclusion that each input neuron's weight that connects to the hidden layer neurons is multiplied by the output error and summed up. This is a logical conclusion, but the formula seems a little confusing since it clearly says the 'weight kh' (between the output layer k and hidden layer h).

Am I understanding everything correctly here? Can anybody confirm this?

What's O(h) of the input layer? My understanding is that each input node has two outputs: one that goes into the the first node of the hidden layer and one that goes into the second node hidden layer. Which of the two outputs should be plugged into the O(h)*(1 - O(h)) part of the formula?

解决方案

The tutorial you posted here is actually doing it wrong. I double checked it against Bishop's two standard books and two of my working implementations. I will point out below where exactly.

An important thing to keep in mind is that you are always searching for derivatives of the error function with respect to a unit or weight. The former are the deltas, the latter is what you use to update your weights.

If you want to understand backpropagation, you have to understand the chain rule. It's all about the chain rule here. If you don't know how it works exactly, check up at wikipedia - it's not that hard. But as soon as you understand the derivations, everything falls into place. Promise! :)

∂E/∂W can be composed into ∂E/∂o ∂o/∂W via the chain rule. ∂o/∂W is easily calculated, since it's just the derivative of the activation/output of a unit with respect to the weights. ∂E/∂o is actually what we call the deltas. (I am assuming that E, o and W are vectors/matrices here)

We do have them for the output units, since that is where we can calculate the error. (Mostly we have an error function that comes down to delta of (t_k - o_k), eg for quadratic error function in the case of linear outputs and cross entropy in case for logistic outputs.)

The question now is, how do we get the derivatives for the internal units? Well, we know that the output of a unit is the sum of all incoming units weighted by their weights and the application of a transfer function afterwards. So o_k = f(sum(w_kj * o_j, for all j)).

So what we do is, derive o_k with respect to o_j. Since delta_j = ∂E/∂o_j = ∂E/∂o_k ∂o_k/∂o_j = delta_k ∂o_k/o_j. So given delta_k, we can calculate delta_j!

Let's do this. o_k = f(sum(w_kj * o_j, for all j)) => ∂o_k/∂o_j = f'(sum(w_kj * o_j, for all j)) * w_kj = f'(z_k) * w_kj.

For the case of the sigmoidal transfer function, this becomes z_k(1 - z_k) * w_kj. (Here is the error in the tutorial, the author says o_k(1 - o_k) * w_kj!)

这篇关于了解神经网络的反向传播的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆