了解PyTorch中的累积梯度 [英] Understanding accumulated gradients in PyTorch

查看：124 发布时间：2021/4/29 20:44:54 python deep-learning pytorch gradient-descent

本文介绍了了解PyTorch中的累积梯度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试理解 PyTorch 中的梯度累积的内部工作原理.我的问题与这两个问题有关:

每个中间张量都自动需要梯度，并具有 grad_fn ，该函数用于计算相对于其输入的偏导数.借助链式规则，我们可以以相反的顺序遍历整个图，以计算相对于每片叶子的导数，这是我们要优化的参数.这就是反向传播的思想，也称为反向模式微分.有关更多详细信息，我建议阅读计算图上的演算:反向传播.

PyTorch使用了确切的想法，当您调用 loss.backward()时，它将以相反的顺序从 loss 开始遍历图形，并计算每个顶点的导数.每当到达叶子时，该张量的计算导数都存储在其 .grad 属性中.

在您的第一个示例中，这将导致:

  MeanBackward->PowBackward->SubBackward->MulBackward`

第二个示例几乎相同，除了您手动计算均值之外，对于损失计算的每个元素，您没有多条路径，而是有一条损失路径.要澄清的是，单一路径还可以计算每个元素的导数，但是在内部，这再次为某些优化打开了可能性.

 #示例1损失=(y-y_hat)** 2#=>张量([16.，4.]，grad_fn =< PowBackward0>)#示例2损失= []对于范围(len(y))中的k:y_hat = model2(x [k])loss.append((y [k]-y_hat)** 2)失利#=>[张量([16.]，grad_fn =< PowBackward0>)，张量([4.]，grad_fn =< PowBackward0>)]]

无论哪种情况，都会创建一个仅向后传播一次的图形，这就是不将其视为梯度累积的原因.

渐变累积

梯度累积是指在更新参数之前执行多次后退遍历的情况.目标是为多个输入(批次)具有相同的模型参数，然后基于所有这些批次更新模型的参数，而不是在每个批次之后执行更新.

让我们回顾一下您的例子. x 的大小为 [2] ，这就是我们整个数据集的大小.由于某些原因，我们需要基于整个数据集计算梯度.使用2的批处理大小自然是这种情况，因为我们可以一次拥有整个数据集.但是，如果我们只能批量生产1个尺寸，会发生什么?我们可以单独运行它们，并像往常一样在每个批次之后更新模型，但是那样我们就不计算整个数据集的梯度.

我们需要做的是，使用相同的模型参数分别运行每个样本，并在不更新模型的情况下计算梯度.现在您可能在想，这不是您在第二版中所做的吗?几乎(但不是完全)，您的版本中存在一个关键问题，即您使用的内存量与第一个版本相同，因为您具有相同的计算量，因此计算图中的值数量也相同.

我们如何释放内存?我们需要摆脱前一批的张量以及计算图，因为这会占用大量内存来跟踪向后传播所需的所有内容.调用 .backward()时，计算图将自动销毁(除非指定了 retain_graph = True ).

  defcalculate_loss(x:torch.Tensor)->火炬张量:y = 2 * xy_hat =模型(x)损失=(y-y_hat)** 2返回loss.mean()#批量为1个批次= [torch.tensor([4.0])，torch.tensor([2.0])]optimizer.zero_grad()对于我，在枚举(批次)中进行批处理:#损失应按比例分配，因为均值应取整#数据集，要求将损失除以批次数.损失= calculate_loss(批次)/len(批次)loss.backward()print(f批次大小1(批次{i})-等级:{model.weight.grad}"))print(f批次大小1(批次{i})-重量:{model.weight}")#仅在所有批次之后更新模型Optimizer.step()print(f批处理大小1(最终)-等级:{model.weight.grad}")print(f批次大小1(最终)-重量:{model.weight}")

输出(为了便于阅读，我删除了包含的消息):

 批处理大小1(批处理0)-grad:张量([-16.])批次大小1(批次0)-重量:张量([1.]，requires_grad = True)批次大小1(批次1)-等级:张量([-20.])批次大小1(批次1)-重量:张量([1.]，requires_grad = True)批量大小1(最终)-渐变:张量([-20.])批次大小1(最终)-重量:张量([1.2000]，requiresgrad = True)

如您所见，在累积梯度的同时，所有批次的模型均保持相同的参数，并且最后只进行了一次更新.请注意，损失需要按批次进行缩放，以使整个数据集具有与使用单个批次相同的重要性.

虽然在此示例中，在执行更新之前使用了整个数据集，但是您可以轻松地更改它，以在一定数量的批次后更新参数，但是您必须记住，在采取优化步骤之后将梯度归零.一般配方为:

  accumulation_steps = 10对于我，在枚举(批次)中进行批处理:#将损失缩放为累计批次大小的平均值损失= calculate_loss(batch)/accumulation_stepsloss.backward()如果(i-1)％accumation_steps == 0:Optimizer.step()#重置梯度，用于下一个累积的批次optimizer.zero_grad()

您可以在Why do we need to call zero_grad() in PyTorch?

Why do we need to explicitly call zero_grad()?

Comments to the accepted answer to the second question suggest that accumulated gradients can be used if a minibatch is too large to perform a gradient update in a single forward pass, and thus has to be split into multiple sub-batches.

Consider the following toy example:

import numpy as np
import torch


class ExampleLinear(torch.nn.Module):

    def __init__(self):
        super().__init__()
        # Initialize the weight at 1
        self.weight = torch.nn.Parameter(torch.Tensor([1]).float(),
                                         requires_grad=True)

    def forward(self, x):
        return self.weight * x


if __name__ == "__main__":
    # Example 1
    model = ExampleLinear()

    # Generate some data
    x = torch.from_numpy(np.array([4, 2])).float()
    y = 2 * x

    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    y_hat = model(x)          # forward pass

    loss = (y - y_hat) ** 2
    loss = loss.mean()        # MSE loss

    loss.backward()           # backward pass

    optimizer.step()          # weight update

    print(model.weight.grad)  # tensor([-20.])
    print(model.weight)       # tensor([1.2000]

Which is exactly the result one would expect. Now assume that we want to process the dataset sample-by-sample utilizing gradient accumulation:

    # Example 2: MSE sample-by-sample
    model2 = ExampleLinear()
    optimizer = torch.optim.SGD(model2.parameters(), lr=0.01)

    # Compute loss sample-by-sample, then average it over all samples
    loss = []
    for k in range(len(y)):
        y_hat = model2(x[k])
        loss.append((y[k] - y_hat) ** 2)
    loss = sum(loss) / len(y)

    loss.backward()     # backward pass
    optimizer.step()    # weight update

    print(model2.weight.grad)  # tensor([-20.])
    print(model2.weight)       # tensor([1.2000]

Again as expected, the gradient is calculated when the .backward() method is called.

Finally to my question: what exactly happens 'under the hood'?

My understanding is that the computational graph is dynamically updated going from <PowBackward> to <AddBackward> <DivBackward> operations for the loss variable, and that no information about the data used for each forward pass is retained anywhere except for the loss tensor which can be updated until the backward pass.

Are there any caveats to the reasoning in the above paragraph? Lastly, are there any best practices to follow when using gradient accumulation (i.e. can the approach I use in Example 2 backfire somehow)?

解决方案

You are not actually accumulating gradients. Just leaving off optimizer.zero_grad() has no effect if you have a single .backward() call, as the gradients are already zero to begin with (technically None but they will be automatically initialised to zero).

The only difference between your two versions, is how you calculate the final loss. The for loop of the second example does the same calculations as PyTorch does in the first example, but you do them individually, and PyTorch cannot optimise (parallelise and vectorise) your for loop, which makes an especially staggering difference on GPUs, granted that the tensors aren't tiny.

Before getting to gradient accumulation, let's start with your question:

Finally to my question: what exactly happens 'under the hood'?

Every operation on tensors is tracked in a computational graph if and only if one of the operands is already part of a computational graph. When you set requires_grad=True of a tensor, it creates a computational graph with a single vertex, the tensor itself, which will remain a leaf in the graph. Any operation with that tensor will create a new vertex, which is the result of the operation, hence there is an edge from the operands to it, tracking the operation that was performed.

a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(4.0)
c = a + b # => tensor(6., grad_fn=<AddBackward0>)

a.requires_grad # => True
a.is_leaf # => True

b.requires_grad # => False
b.is_leaf # => True

c.requires_grad # => True
c.is_leaf # => False

Every intermediate tensor automatically requires gradients and has a grad_fn, which is the function to calculate the partial derivatives with respect to its inputs. Thanks to the chain rule, we can traverse the whole graph in reverse order to calculate the derivatives with respect to every single leaf, which are the parameters we want to optimise. That's the idea of backpropagation, also known as reverse mode differentiation. For more details I recommend reading Calculus on Computational Graphs: Backpropagation.

PyTorch uses that exact idea, when you call loss.backward() it traverses the graph in reverse order, starting from loss, and calculates the derivatives for each vertex. Whenever a leaf is reached, the calculated derivative for that tensor is stored in its .grad attribute.

In your first example, that would lead to:

MeanBackward -> PowBackward -> SubBackward -> MulBackward`

The second example is almost identical, except that you calculate the mean manually, and instead of having a single path for the loss, you have multiple paths for each element of the loss calculation. To clarify, the single path also calculates the derivatives of each element, but internally, which again opens up the possibilities for some optimisations.

# Example 1
loss = (y - y_hat) ** 2
# => tensor([16.,  4.], grad_fn=<PowBackward0>)

# Example 2
loss = []
for k in range(len(y)):
    y_hat = model2(x[k])
    loss.append((y[k] - y_hat) ** 2)
loss
# => [tensor([16.], grad_fn=<PowBackward0>), tensor([4.], grad_fn=<PowBackward0>)]

In either case a single graph is created that is backpropagated exactly once, that's the reason it's not considered gradient accumulation.

Gradient Accumulation

Gradient accumulation refers to the situation, where multiple backwards passes are performed before updating the parameters. The goal is to have the same model parameters for multiple inputs (batches) and then update the model's parameters based on all these batches, instead of performing an update after every single batch.

Let's revisit your example. x has size [2], that's the size of our entire dataset. For some reason, we need to calculate the gradients based on the whole dataset. That is naturally the case when using a batch size of 2, since we would have the whole dataset at once. But what happens if we can only have batches of size 1? We could run them individually and update the model after each batch as usual, but then we don't calculate the gradients over the whole dataset.

What we need to do, is run each sample individually with the same model parameters and calculate the gradients without updating the model. Now you might be thinking, isn't that what you did in the second version? Almost, but not quite, and there is a crucial problem in your version, namely that you are using the same amount of memory as in the first version, because you have the same calculations and therefore the same number of values in the computational graph.

How do we free memory? We need to get rid of the tensors of the previous batch and also the computational graph, because that uses a lot of memory to keep track of everything that's necessary for the backpropagation. The computational graph is automatically destroyed when .backward() is called (unless retain_graph=True is specified).

def calculate_loss(x: torch.Tensor) -> torch.Tensor:
    y = 2 * x
    y_hat = model(x)
    loss = (y - y_hat) ** 2
    return loss.mean()


# With mulitple batches of size 1
batches = [torch.tensor([4.0]), torch.tensor([2.0])]

optimizer.zero_grad()
for i, batch in enumerate(batches):
    # The loss needs to be scaled, because the mean should be taken across the whole
    # dataset, which requires the loss to be divided by the number of batches.
    loss = calculate_loss(batch) / len(batches)
    loss.backward()
    print(f"Batch size 1 (batch {i}) - grad: {model.weight.grad}")
    print(f"Batch size 1 (batch {i}) - weight: {model.weight}")

# Updating the model only after all batches
optimizer.step()
print(f"Batch size 1 (final) - grad: {model.weight.grad}")
print(f"Batch size 1 (final) - weight: {model.weight}")

Output (I removed the Parameter containing messages for readability):

Batch size 1 (batch 0) - grad: tensor([-16.])
Batch size 1 (batch 0) - weight: tensor([1.], requires_grad=True)
Batch size 1 (batch 1) - grad: tensor([-20.])
Batch size 1 (batch 1) - weight: tensor([1.], requires_grad=True)
Batch size 1 (final) - grad: tensor([-20.])
Batch size 1 (final) - weight: tensor([1.2000], requires_grad=True)

As you can see, the model kept the same parameter for all batches, while the gradients were accumulate, and there is a single update at the end. Note that the loss needs to be scaled per batch, in order to have the same significance over the whole dataset as if you used a single batch.

While in this example, the whole dataset is used before performing the update, you can easily change that to update the parameters after a certain number of batches, but you have to remember to zero out the gradients after an optimiser step was taken. The general recipe would be:

accumulation_steps = 10
for i, batch in enumerate(batches):
    # Scale the loss to the mean of the accumulated batch size
    loss = calculate_loss(batch) / accumulation_steps
    loss.backward()
    if (i - 1) % accumulation_steps == 0:
        optimizer.step()
        # Reset gradients, for the next accumulated batches
        optimizer.zero_grad()

You can find that recipe and more techniques for working with large batch sizes in HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.

这篇关于了解PyTorch中的累积梯度的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

了解PyTorch中的累积梯度 [英] Understanding accumulated gradients in PyTorch

问题描述

渐变累积

Gradient Accumulation

相关文章

Python最新文章

热门教程

热门工具

登录关闭

了解PyTorch中的累积梯度 [英] Understanding accumulated gradients in PyTorch

问题描述

渐变累积

Gradient Accumulation

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭