正向与反向模式差异-Pytorch [英] Forward vs reverse mode differentiation - Pytorch

查看:160
本文介绍了正向与反向模式差异-Pytorch的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过示例学习PyTorch 的第一个示例中,作者演示了如何用numpy创建一个神经网络.为了方便起见,将其代码粘贴在下面:

In the first example of Learning PyTorch with Examples, the author demonstrates how to create a neural network with numpy. Their code is pasted below for convenience:

# from: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

令我困惑的是,为什么相对于损耗(第二个到最后一个代码块)计算w1和w2的梯度.

What is confusing to me is why gradients of w1 and w2 are computed with respect to loss (2nd to last code block).

通常发生相反的计算:损失的梯度是相对于权重计算的,如此处引述:

Normally the opposite computation happens: the gradients of loss is computed with respect to the weights, as quoted here:

  • 在训练神经网络时,我们认为成本(描述神经网络的不良表现的值)是参数(描述网络行为的数字)的函数.梯度下降中使用的所有参数的成本.现在,神经网络中通常有数百万甚至数千万个参数,因此,反向模式微分在背景下称为反向传播神经网络,使我们的速度大大提高!" (可乐的博客).
  • "When training neural networks, we think of the cost (a value describing how bad a neural network performs) as a function of the parameters (numbers describing how the network behaves). We want to calculate the derivatives of the cost with respect to all the parameters, for use in gradient descent. Now, there’s often millions, or even tens of millions of parameters in a neural network. So, reverse-mode differentiation, called backpropagation in the context of neural networks, gives us a massive speed up!" (Colah's blog).

所以我的问题是:与普通的反向传播计算相比,为什么上例中的推导计算是相反的顺序?

So my question is: why is the derivation computation in the example above in reverse order as compared to normal back propagation computations?

推荐答案

似乎是注释中的错字.他们实际上是在计算loss w.r.t. w2w1.

Seems to be a typo in the comment. They are actually computing gradient of loss w.r.t. w2 and w1.

让我们快速得出loss w.r.t. w2只是为了确定.通过检查您的代码,我们有

Let's quickly derive the gradient of loss w.r.t. w2 just to be sure. By inspection of your code we have

使用微积分中的链式规则

Using the chain rule from calculus

.

每个术语都可以使用矩阵演算的基本规则表示.事实证明是

Each term can be represented using the basic rules of matrix calculus. These turn out to be

.

将这些项放回初始方程式中

Plugging these terms back into the initial equation we get

.

与以下描述的表达式完全匹配

Which perfectly matches the expressions described by

grad_y_pred = 2.0 * (y_pred - y)       # gradient of loss w.r.t. y_pred
grad_w2 = h_relu.T.dot(grad_y_pred)    # gradient of loss w.r.t. w2

在您提供的反向传播代码中.

in the back-propagation code you provided.

这篇关于正向与反向模式差异-Pytorch的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆