多层神经网络的反向传播公式(使用随机梯度下降法) [英] Multi-layer neural network back-propagation formula (using stochastic gradient descent)

查看：513 发布时间：2020/5/4 9:55:03 python machine-learning neural-network backpropagation gradient-descent

本文介绍了多层神经网络的反向传播公式(使用随机梯度下降法)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用反向传播演算中的符号|深度学习，第4章，我有一个用于4层(即2个隐藏层)神经网络的反向传播代码:

Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:

def sigmoid_prime(z): 
    return z * (1-z)  # because σ'(x) = σ(x) (1 - σ(x))

def train(self, input_vector, target_vector):
    a = np.array(input_vector, ndmin=2).T
    y = np.array(target_vector, ndmin=2).T

    # forward
    A = [a]  
    for k in range(3):
        a = sigmoid(np.dot(self.weights[k], a))  # zero bias here just for simplicity
        A.append(a)

    # Now A has 4 elements: the input vector + the 3 outputs vectors

    # back-propagation
    delta = a - y
    for k in [2, 1, 0]:
        tmp = delta * sigmoid_prime(A[k+1])
        delta = np.dot(self.weights[k].T, tmp)  # (1)  <---- HERE
        self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

它有效，但是:

最后的准确度(对于我的用例:MNIST数字识别)虽然可以，但不是很好. 将行(1)替换为会更好(即收敛性会更好):

the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good. It is much better (i.e. the convergence is much better) when the line (1) is replaced by:

delta = np.dot(self.weights[k].T, delta)  # (2)

来自使用Python进行机器学习的代码:使用MNIST训练和测试神经网络数据集还建议:

the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests:

delta = np.dot(self.weights[k].T, delta)

代替:

delta = np.dot(self.weights[k].T, tmp)

(使用本文的注释，表示为:

(With the notations of this article, it is:

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)

)

这两个参数似乎是一致的:代码(2)优于代码(1).

These 2 arguments seem to be concordant: code (2) is better than code (1).

但是，数学似乎显示出相反的含义(请参见此处的视频；另一个详细信息:请注意，我的损失函数乘以1/2，而在视频中却没有):

However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):

问题:哪个是正确的:实现(1)或(2)?

在LaTeX中:

$$C = \frac{1}{2} (a^L - y)^2$$
$$a^L = \sigma(\underbrace{w^L a^{L-1} + b^L}_{z^L}) = \sigma(z^L)$$
$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y)$$

推荐答案

我花了两天时间来分析此问题，我用偏导数计算在笔记本的几页纸上写着……我可以确认:

I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:

用LaTeX编写的问题是正确的
代码(1)是正确的，并且与数学计算相符:

the maths written in LaTeX in the question are correct
the code (1) is the correct one, and it agrees with the math computations:

delta = a - y
for k in [2, 1, 0]:
    tmp = delta * sigmoid_prime(A[k+1])
    delta = np.dot(self.weights[k].T, tmp)
    self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

代码(2)错误:

delta = a - y
for k in [2, 1, 0]:
    tmp = delta * sigmoid_prime(A[k+1])
    delta = np.dot(self.weights[k].T, delta)  # WRONG HERE
    self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

，并且使用Python进行机器学习:使用MNIST训练和测试神经网络数据集:

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)

应该是

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))

现在我花了几天时间才意识到的困难部分:

Now the difficult part that took me days to realize:

显然，代码(2)的收敛性比代码(1)好得多，这就是为什么我误以为代码(2)是正确的，而代码(1)是错误的

Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong

...但实际上这只是一个巧合，因为learning_rate设置得太低.原因是:使用代码(2)时，参数delta的增长比代码(1)快得多(print np.linalg.norm(delta)有助于查看).

... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).

因此，不正确的代码(2)"只是通过具有更大的delta参数来补偿学习速度太慢"，在某些情况下，它会导致收敛速度明显加快.

Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.

现在解决了！

这篇关于多层神经网络的反向传播公式(使用随机梯度下降法)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

多层神经网络的反向传播公式(使用随机梯度下降法) [英] Multi-layer neural network back-propagation formula (using stochastic gradient descent)

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

多层神经网络的反向传播公式(使用随机梯度下降法) [英] Multi-layer neural network back-propagation formula (using stochastic gradient descent)

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭