多层神经网络的反向传播公式(使用随机梯度下降法) [英] Multi-layer neural network back-propagation formula (using stochastic gradient descent)
问题描述
使用反向传播演算中的符号|深度学习,第4章,我有一个用于4层(即2个隐藏层)神经网络的反向传播代码:
Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:
def sigmoid_prime(z):
return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))
def train(self, input_vector, target_vector):
a = np.array(input_vector, ndmin=2).T
y = np.array(target_vector, ndmin=2).T
# forward
A = [a]
for k in range(3):
a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity
A.append(a)
# Now A has 4 elements: the input vector + the 3 outputs vectors
# back-propagation
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
它有效,但是:
-
最后的准确度(对于我的用例:MNIST数字识别)虽然可以,但不是很好. 将行(1)替换为会更好(即收敛性会更好):
the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good. It is much better (i.e. the convergence is much better) when the line (1) is replaced by:
delta = np.dot(self.weights[k].T, delta) # (2)
the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests:
delta = np.dot(self.weights[k].T, delta)
代替:
delta = np.dot(self.weights[k].T, tmp)
(使用本文的注释,表示为:
(With the notations of this article, it is:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
)
这两个参数似乎是一致的:代码(2)优于代码(1).
These 2 arguments seem to be concordant: code (2) is better than code (1).
但是,数学似乎显示出相反的含义(请参见此处的视频;另一个详细信息:请注意,我的损失函数乘以1/2,而在视频中却没有):
However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):
问题:哪个是正确的:实现(1)或(2)?
在LaTeX中:
$$C = \frac{1}{2} (a^L - y)^2$$
$$a^L = \sigma(\underbrace{w^L a^{L-1} + b^L}_{z^L}) = \sigma(z^L)$$
$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y)$$
推荐答案
我花了两天时间来分析此问题,我用偏导数计算在笔记本的几页纸上写着……我可以确认:
I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:
- 用LaTeX编写的问题是正确的 中的数学
-
代码(1)是正确的,并且与数学计算相符:
- the maths written in LaTeX in the question are correct
the code (1) is the correct one, and it agrees with the math computations:
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp)
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
代码(2)错误:
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, delta) # WRONG HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)
,并且使用Python进行机器学习:使用MNIST训练和测试神经网络数据集:
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
应该是
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))
现在我花了几天时间才意识到的困难部分:
Now the difficult part that took me days to realize:
-
显然,代码(2)的收敛性比代码(1)好得多,这就是为什么我误以为代码(2)是正确的,而代码(1)是错误的
Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong
...但实际上这只是一个巧合,因为learning_rate
设置得太低.原因是:使用代码(2)时,参数delta
的增长比代码(1)快得多(print np.linalg.norm(delta)
有助于查看).
... But in fact that's just a coincidence because the learning_rate
was set too low. Here is the reason: when using code (2), the parameter delta
is growing much faster (print np.linalg.norm(delta)
helps to see this) than with the code (1).
因此,不正确的代码(2)"只是通过具有更大的delta
参数来补偿学习速度太慢",在某些情况下,它会导致收敛速度明显加快.
Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta
parameter, and it lead, in some cases, to an apparently faster convergence.
现在解决了!
这篇关于多层神经网络的反向传播公式(使用随机梯度下降法)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!