了解 PyTorch 中的累积梯度 [英] Understanding accumulated gradients in PyTorch

查看:65
本文介绍了了解 PyTorch 中的累积梯度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图理解 PyTorch 中梯度累积的内部工作原理.我的问题与这两个有些相关:

每个中间张量自动需要梯度并且有一个grad_fn,它是计算关于其输入的偏导数的函数.由于链式法则,我们可以以相反的顺序遍历整个图来计算关于每个叶子的导数,这是我们想要优化的参数.这就是反向传播的思想,也称为反向模式微分.有关更多详细信息,我建议阅读 计算图微积分:反向传播..>

PyTorch 使用了那个确切的想法,当您调用 loss.backward() 时,它以相反的顺序遍历图形,从 loss 开始,并计算每个顶点的导数.每当到达叶子时,该张量的计算导数存储在其 .grad 属性中.

在您的第一个示例中,这将导致:

MeanBackward ->PowBackward ->SubBackward ->MulBackward`

第二个示例几乎相同,只是您手动计算平均值,并且损失计算的每个元素都有多个路径,而不是单个路径.澄清一下,单一路径也计算每个元素的导数,但在内部,这又为一些优化打开了可能性.

# Example 1损失 = (y - y_hat) ** 2# =>张量([16., 4.], grad_fn=)# 示例 2损失 = []对于范围内的 k(len(y)):y_hat = 模型 2(x[k])loss.append((y[k] - y_hat) ** 2)损失# =>[张量([16.],grad_fn=),张量([4.],grad_fn=)]

在任何一种情况下,都会创建一个仅反向传播一次的图,这就是它不被视为梯度累积的原因.

梯度累积

梯度累积是指在更新参数之前执行多次反向传递的情况.目标是让多个输入(批次)具有相同的模型参数,然后根据所有这些批次更新模型参数,而不是在每个批次之后执行更新.

让我们重新审视你的例子.x 的大小为 [2],这是我们整个数据集的大小.出于某种原因,我们需要基于整个数据集计算梯度.当使用批量大小为 2 时,情况自然是这样,因为我们将同时拥有整个数据集.但是如果我们只能有大小为 1 的批次会发生什么?我们可以像往常一样单独运行它们并在每批之后更新模型,但是我们不会计算整个数据集的梯度.

我们需要做的是,使用相同的模型参数单独运行每个样本,并在不更新模型的情况下计算梯度.现在您可能会想,这不是您在第二个版本中所做的吗?几乎,但不完全是,您的版本存在一个关键问题,即您使用的内存量与第一个版本相同,因为您具有相同的计算,因此计算图中的值数量相同.

我们如何释放内存?我们需要摆脱前一批的张量以及计算图,因为它使用大量内存来跟踪反向传播所需的一切.调用.backward()时计算图会自动销毁(除非指定了retain_graph=True).

def calculate_loss(x: torch.Tensor) ->火炬.张量:y = 2 * xy_hat = 模型(x)损失 = (y - y_hat) ** 2返回损失.mean()# 多批次大小为 1批次 = [torch.tensor([4.0]), torch.tensor([2.0])]optimizer.zero_grad()对于我,批量枚举(批次):# 需要对损失进行缩放,因为应该在整体上取均值# 数据集,需要将损失除以批次数.损失 = 计算损失(批次)/len(批次)损失.向后()打印(f批量大小 1(批量 {i})- grad:{model.weight.grad}")打印(f批量大小 1(批量 {i})-重量:{model.weight}")# 仅在所有批次之后更新模型优化器.step()打印(f批量大小 1(最终)- grad:{model.weight.grad}")打印(f批量大小 1(最终)-重量:{model.weight}")

输出(我删除了包含消息的参数以提高可读性):

batch size 1 (batch 0) - grad: tensor([-16.])批量大小 1 (batch 0) - 权重:张量([1.], requires_grad=True)批量大小 1 (batch 1) - grad: tensor([-20.])批量大小 1 (batch 1) - 权重:张量([1.], requires_grad=True)批量大小 1(最终)- grad:张量([-20.])批量大小 1(最终)- 权重:张量([1.2000], requires_grad=True)

如您所见,模型对所有批次保持相同的参数,同时梯度累积,最后有一次更新.请注意,损失需要按批次进行缩放,以便在整个数据集上具有与使用单个批次相同的重要性.

虽然在此示例中,在执行更新之前使用了整个数据集,但您可以轻松更改它以在一定数量的批次后更新参数,但您必须记住在执行优化器步骤后将梯度归零.一般配方是:

accumulation_steps = 10对于我,批量枚举(批次):# 将损失缩放到累积批量大小的平均值损失=calculate_loss(batch)/accumulation_steps损失.向后()如果 (i + 1) %cumulative_steps == 0:优化器.step()# 重置梯度,用于下一个累积批次optimizer.zero_grad()

您可以在 HuggingFace - 大批量训练神经网络:1-GPU、多 GPU 和分布式设置.

I am trying to comprehend inner workings of the gradient accumulation in PyTorch. My question is somewhat related to these two:

Why do we need to call zero_grad() in PyTorch?

Why do we need to explicitly call zero_grad()?

Comments to the accepted answer to the second question suggest that accumulated gradients can be used if a minibatch is too large to perform a gradient update in a single forward pass, and thus has to be split into multiple sub-batches.

Consider the following toy example:

import numpy as np
import torch


class ExampleLinear(torch.nn.Module):

    def __init__(self):
        super().__init__()
        # Initialize the weight at 1
        self.weight = torch.nn.Parameter(torch.Tensor([1]).float(),
                                         requires_grad=True)

    def forward(self, x):
        return self.weight * x


if __name__ == "__main__":
    # Example 1
    model = ExampleLinear()

    # Generate some data
    x = torch.from_numpy(np.array([4, 2])).float()
    y = 2 * x

    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    y_hat = model(x)          # forward pass

    loss = (y - y_hat) ** 2
    loss = loss.mean()        # MSE loss

    loss.backward()           # backward pass

    optimizer.step()          # weight update

    print(model.weight.grad)  # tensor([-20.])
    print(model.weight)       # tensor([1.2000]

Which is exactly the result one would expect. Now assume that we want to process the dataset sample-by-sample utilizing gradient accumulation:

    # Example 2: MSE sample-by-sample
    model2 = ExampleLinear()
    optimizer = torch.optim.SGD(model2.parameters(), lr=0.01)

    # Compute loss sample-by-sample, then average it over all samples
    loss = []
    for k in range(len(y)):
        y_hat = model2(x[k])
        loss.append((y[k] - y_hat) ** 2)
    loss = sum(loss) / len(y)

    loss.backward()     # backward pass
    optimizer.step()    # weight update

    print(model2.weight.grad)  # tensor([-20.])
    print(model2.weight)       # tensor([1.2000]

Again as expected, the gradient is calculated when the .backward() method is called.

Finally to my question: what exactly happens 'under the hood'?

My understanding is that the computational graph is dynamically updated going from <PowBackward> to <AddBackward> <DivBackward> operations for the loss variable, and that no information about the data used for each forward pass is retained anywhere except for the loss tensor which can be updated until the backward pass.

Are there any caveats to the reasoning in the above paragraph? Lastly, are there any best practices to follow when using gradient accumulation (i.e. can the approach I use in Example 2 backfire somehow)?

解决方案

You are not actually accumulating gradients. Just leaving off optimizer.zero_grad() has no effect if you have a single .backward() call, as the gradients are already zero to begin with (technically None but they will be automatically initialised to zero).

The only difference between your two versions, is how you calculate the final loss. The for loop of the second example does the same calculations as PyTorch does in the first example, but you do them individually, and PyTorch cannot optimise (parallelise and vectorise) your for loop, which makes an especially staggering difference on GPUs, granted that the tensors aren't tiny.

Before getting to gradient accumulation, let's start with your question:

Finally to my question: what exactly happens 'under the hood'?

Every operation on tensors is tracked in a computational graph if and only if one of the operands is already part of a computational graph. When you set requires_grad=True of a tensor, it creates a computational graph with a single vertex, the tensor itself, which will remain a leaf in the graph. Any operation with that tensor will create a new vertex, which is the result of the operation, hence there is an edge from the operands to it, tracking the operation that was performed.

a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(4.0)
c = a + b # => tensor(6., grad_fn=<AddBackward0>)

a.requires_grad # => True
a.is_leaf # => True

b.requires_grad # => False
b.is_leaf # => True

c.requires_grad # => True
c.is_leaf # => False

Every intermediate tensor automatically requires gradients and has a grad_fn, which is the function to calculate the partial derivatives with respect to its inputs. Thanks to the chain rule, we can traverse the whole graph in reverse order to calculate the derivatives with respect to every single leaf, which are the parameters we want to optimise. That's the idea of backpropagation, also known as reverse mode differentiation. For more details I recommend reading Calculus on Computational Graphs: Backpropagation.

PyTorch uses that exact idea, when you call loss.backward() it traverses the graph in reverse order, starting from loss, and calculates the derivatives for each vertex. Whenever a leaf is reached, the calculated derivative for that tensor is stored in its .grad attribute.

In your first example, that would lead to:

MeanBackward -> PowBackward -> SubBackward -> MulBackward`

The second example is almost identical, except that you calculate the mean manually, and instead of having a single path for the loss, you have multiple paths for each element of the loss calculation. To clarify, the single path also calculates the derivatives of each element, but internally, which again opens up the possibilities for some optimisations.

# Example 1
loss = (y - y_hat) ** 2
# => tensor([16.,  4.], grad_fn=<PowBackward0>)

# Example 2
loss = []
for k in range(len(y)):
    y_hat = model2(x[k])
    loss.append((y[k] - y_hat) ** 2)
loss
# => [tensor([16.], grad_fn=<PowBackward0>), tensor([4.], grad_fn=<PowBackward0>)]

In either case a single graph is created that is backpropagated exactly once, that's the reason it's not considered gradient accumulation.

Gradient Accumulation

Gradient accumulation refers to the situation, where multiple backwards passes are performed before updating the parameters. The goal is to have the same model parameters for multiple inputs (batches) and then update the model's parameters based on all these batches, instead of performing an update after every single batch.

Let's revisit your example. x has size [2], that's the size of our entire dataset. For some reason, we need to calculate the gradients based on the whole dataset. That is naturally the case when using a batch size of 2, since we would have the whole dataset at once. But what happens if we can only have batches of size 1? We could run them individually and update the model after each batch as usual, but then we don't calculate the gradients over the whole dataset.

What we need to do, is run each sample individually with the same model parameters and calculate the gradients without updating the model. Now you might be thinking, isn't that what you did in the second version? Almost, but not quite, and there is a crucial problem in your version, namely that you are using the same amount of memory as in the first version, because you have the same calculations and therefore the same number of values in the computational graph.

How do we free memory? We need to get rid of the tensors of the previous batch and also the computational graph, because that uses a lot of memory to keep track of everything that's necessary for the backpropagation. The computational graph is automatically destroyed when .backward() is called (unless retain_graph=True is specified).

def calculate_loss(x: torch.Tensor) -> torch.Tensor:
    y = 2 * x
    y_hat = model(x)
    loss = (y - y_hat) ** 2
    return loss.mean()


# With mulitple batches of size 1
batches = [torch.tensor([4.0]), torch.tensor([2.0])]

optimizer.zero_grad()
for i, batch in enumerate(batches):
    # The loss needs to be scaled, because the mean should be taken across the whole
    # dataset, which requires the loss to be divided by the number of batches.
    loss = calculate_loss(batch) / len(batches)
    loss.backward()
    print(f"Batch size 1 (batch {i}) - grad: {model.weight.grad}")
    print(f"Batch size 1 (batch {i}) - weight: {model.weight}")

# Updating the model only after all batches
optimizer.step()
print(f"Batch size 1 (final) - grad: {model.weight.grad}")
print(f"Batch size 1 (final) - weight: {model.weight}")

Output (I removed the Parameter containing messages for readability):

Batch size 1 (batch 0) - grad: tensor([-16.])
Batch size 1 (batch 0) - weight: tensor([1.], requires_grad=True)
Batch size 1 (batch 1) - grad: tensor([-20.])
Batch size 1 (batch 1) - weight: tensor([1.], requires_grad=True)
Batch size 1 (final) - grad: tensor([-20.])
Batch size 1 (final) - weight: tensor([1.2000], requires_grad=True)

As you can see, the model kept the same parameter for all batches, while the gradients were accumulate, and there is a single update at the end. Note that the loss needs to be scaled per batch, in order to have the same significance over the whole dataset as if you used a single batch.

While in this example, the whole dataset is used before performing the update, you can easily change that to update the parameters after a certain number of batches, but you have to remember to zero out the gradients after an optimiser step was taken. The general recipe would be:

accumulation_steps = 10
for i, batch in enumerate(batches):
    # Scale the loss to the mean of the accumulated batch size
    loss = calculate_loss(batch) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        # Reset gradients, for the next accumulated batches
        optimizer.zero_grad()

You can find that recipe and more techniques for working with large batch sizes in HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.

这篇关于了解 PyTorch 中的累积梯度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆