为什么我们需要在 PyTorch 中调用 zero_grad()? [英] Why do we need to call zero_grad() in PyTorch?

查看:40
本文介绍了为什么我们需要在 PyTorch 中调用 zero_grad()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

zero_grad() 方法需要在训练过程中调用.但是文档并不是很有帮助>

The method zero_grad() needs to be called during training. But the documentation is not very helpful

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

为什么需要调用这个方法?

Why do we need to call this method?

推荐答案

PyTorch,对于训练阶段的每个小批量,我们通常希望在开始进行反向传播之前明确地将梯度设置为零(即更新Wstrong>8 和 biases),因为 PyTorch 在随后的反向传递中累积梯度.在训练 RNN 或当我们想要计算多个 小批量 上的损失的梯度时,这种累积行为很方便.因此,默认操作已设置为累积(即求和)梯度 在每个 loss.backward() 调用中.

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.

因此,当您开始训练循环时,理想情况下您应该 将梯度归零 以便正确进行参数更新.否则,梯度将是您已经用于更新模型参数的旧梯度和新计算的梯度的组合.因此,它会指向与最小值(或最大值,在最大化目标的情况下)的预期方向不同的其他方向.

Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).

这是一个简单的例子:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()


或者,如果您正在执行普通梯度下降,则:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data


注意:

  • 梯度的累积(即sum)发生在.backward()loss 张量上被调用.
  • 从 v1.7.0 开始,Pytorch 提供了将梯度重置为 None 的选项 optimizer.zero_grad(set_to_none=True) 而不是用零张量填充它们.文档声称此设置减少了内存需求并略微提高了性能,但如果处理不当可能容易出错.
  • The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
  • As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.

这篇关于为什么我们需要在 PyTorch 中调用 zero_grad()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆