RNN中的梯度累积 [英] Gradient accumulation in an RNN

查看:130
本文介绍了RNN中的梯度累积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在运行大型RNN网络时,我遇到了一些内存问题(GPU),但是我想使批处理大小保持合理,因此我想尝试进行梯度累积.在一个可以一口气预测输出的网络中,这似乎是不言而喻的,但在RNN中,您需要为每个输入步骤进行多次前向传递.因此,我担心我的实现无法按预期工作.我从albanD用户的出色示例开始

I ran into some memory issues (GPU) when running a large RNN network, but I want to keep my batch size reasonable so I wanted to try out gradient accumulation. In a network where you predict the output in one go, that seems self-evident but in an RNN you do multiple forward passes for each input step. Because of that, I fear that my implementation does not work as intended. I started from user albanD's excellent examples here , but I think they should be modified when using an RNN. The reason I think that is because you accumulate much more gradients because you do multiple forwards per sequence.

我当前的实现看起来像这样,同时允许在PyTorch 1.6中使用AMP,这似乎很重要-一切都需要在正确的地方调用.请注意,这只是一个抽象版本,可能看起来像很多代码,但主要是注释.

My current implementation looks like this, at the same time allowing for AMP in PyTorch 1.6 which seems important - everything needs to be called in the right place. Note that this is just an abstract version, which might seem like a lot of code but it is mostly comments.

def train(epochs):
    """Main training loop. Loops for `epoch` number of epochs. Calls `process`."""
    for epoch in range(1, epochs + 1):
        train_loss = process("train")
        valid_loss = process("valid")
        # ... check whether we improved over earlier epochs
        if lr_scheduler:
            lr_scheduler.step(valid_loss)
        
def process(do):
    """Do a single epoch run through the dataloader of the training or validation set. 
       Also takes care of optimizing the model after every `gradient_accumulation_steps` steps.
       Calls `step` for each batch where it gets the loss from."""
    if do == "train":
        model.train()
        torch.set_grad_enabled(True)
    else:
        model.eval()
        torch.set_grad_enabled(False)
    
    loss = 0.
    for batch_idx, batch in enumerate(dataloaders[do]):
        step_loss, avg_step_loss = step(batch)
        loss += avg_step_loss

        if do == "train":
            if amp:
                scaler.scale(step_loss).backward()

                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    # Unscales the gradients of optimizer's assigned params in-place
                    scaler.unscale_(optimizer)
                    # clip in-place
                    clip_grad_norm_(model.parameters(), 2.0)
                    scaler.step(optimizer)
                    scaler.update()
                    model.zero_grad()
            else:
                step_loss.backward()
                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    clip_grad_norm_(model.parameters(), 2.0)
                    optimizer.step()
                    model.zero_grad()
        
        # return average loss
        return loss / len(dataloaders[do])

    def step():
        """Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
        # do stuff... init hidden state and first input etc.
        loss = torch.tensor([0.]).to(device)
        
        for i in range(target_len):
            with torch.cuda.amp.autocast(enabled=amp):
                # overwrite previous decoder_hidden
                output, decoder_hidden = model(decoder_input, decoder_hidden)

                # compute loss between predicted classes (bs x classes) and correct classes for _this word_
                item_loss = criterion(output, target_tensor[i])

                # We calculate the gradients for the average step so that when
                # we do take an optimizer.step, it takes into account the mean step_loss
                # across batches. So basically (A+B+C)/3 = A/3 + B/3 + C/3
                loss += (item_loss / gradient_accumulation_steps)

            topv, topi = output.topk(1)
            decoder_input = topi.detach()
        
        return loss, loss.item() / target_len

以上内容似乎并不像我希望的那样起作用,也就是说,它仍然很快会遇到内存不足的问题.我认为原因是 step 已经积累了很多信息,但我不确定.

The above does not seem to work as I had hoped, i.e. it still runs into out-of-memory issues very quickly. I think the reason is that step already accumulates so much information, but I am not sure.

推荐答案

为简单起见,我将只考虑启用 amp 的梯度累积,而没有amp的想法是相同的.您呈现的步骤在 amp 下运行,所以我们坚持下去.

For simplicity, I will only take care of amp enabled gradient accumulation, without amp the idea is the same. And your step presented runs under amp so let's stick to that.

关于amp的PyTorch文档中,您有一个示例梯度积累.您应该在 step 中执行此操作.每次运行 loss.backward()时,梯度会累积在张量叶中,可以通过 optimizer 对其进行优化.因此,您的 step 应该如下所示(请参见注释):

In PyTorch documentation about amp you have an example of gradient accumulation. You should do it inside step. Each time you run loss.backward() gradient is accumulated inside tensor leafs which can be optimized by optimizer. Hence, your step should look like this (see comments):

def step():
    """Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
    # You should not accumulate loss on `GPU`, RAM and CPU is better for that
    # Use GPU only for calculations, not for gathering metrics etc.
    loss = 0

    for i in range(target_len):
        with torch.cuda.amp.autocast(enabled=amp):
            # where decoder_input is from?
            # I assume there is one in real code
            output, decoder_hidden = model(decoder_input, decoder_hidden)
            # Here you divide by accumulation steps
            item_loss = criterion(output, target_tensor[i]) / (
                gradient_accumulation_steps * target_len
            )


        scaler.scale(item_loss).backward()
        loss += item_loss.detach().item()

        # Not sure what was topv for here
        _, topi = output.topk(1)
        decoder_input = topi.detach()

    # No need to return loss now as we did backward above
    return loss / target_len

无论如何,当您分离 decoder_input 时(因此,就像全新的隐藏输入没有历史记录一样,参数将基于此进行优化,而不是基于所有运行),无需进行后退.另外,您可能不需要 decoder_hidden ,如果它没有传递到网络,则填充有零的 torch.tensor 会隐式传递.

As you detach decoder_input anyway (so it is like totally new hidden input without history and parameters will be optimized based on that, not based on all runs) there is no need for backward in process. Also, you probably don't need decoder_hidden, if it isn't passed to the network, torch.tensor filled with zeros is passed implicitly.

此外,我们还应除以 gradient_accumulation_steps * target_len ,因为这是在单个优化步骤之前将要运行的 backward 个.

Also we should divide by gradient_accumulation_steps * target_len as that's how many backwards we will run before single optimization step.

由于您的某些变量定义不正确,我假设您只是对发生的事情进行了规划.

As some of your variables are ill-defined I assume you just made a scheme of what's going on.

此外,如果您希望保留历史记录,则不应分离 decoder_input ,在这种情况下,它看起来像这样:

Also, if you want the history to be kept you shouldn't detach decoder_input, in this case it would look like this:

def step():
    """Processes one step (one batch) by forwarding multiple times to get a final prediction for a given sequence."""
    loss = 0

    for i in range(target_len):
        with torch.cuda.amp.autocast(enabled=amp):
            output, decoder_hidden = model(decoder_input, decoder_hidden)
            item_loss = criterion(output, target_tensor[i]) / (
                gradient_accumulation_steps * target_len
            )

        _, topi = output.topk(1)
        decoder_input = topi

        loss += item_loss
    scaler.scale(loss).backward()
    return loss.detach().cpu() / target_len

这有效地通过了RNN多次,并且可能会提高OOM,不确定您在这里追求什么.如果是这种情况,那么您就可以执行AFAIK,因为RNN计算太长而无法放入GPU.

This effectively goes through RNN multiple times and will probably raise OOM, not sure what you are after here. If that's the case there's not much you can do AFAIK as the RNN computations are simply too long to fit into the GPU.

仅显示此代码的相关部分,因此它将是:

Only relevant part of this code is presented, so it would be:

loss = 0.0
for batch_idx, batch in enumerate(dataloaders[do]):
    # Here everything is detached from graph so we're safe
    avg_step_loss = step(batch)
    loss += avg_step_loss

    if do == "train":
        if (batch_idx + 1) % gradient_accumulation_steps == 0:
            # You can use unscale as in the example in PyTorch's docs
            # just like you did
            scaler.unscale_(optimizer)
            # clip in-place
            clip_grad_norm_(model.parameters(), 2.0)
            scaler.step(optimizer)
            scaler.update()
            # IMO in this case optimizer.zero_grad is more readable
            # but it's a nitpicking
            optimizer.zero_grad()

# return average loss
return loss / len(dataloaders[do])

类似问题

[...]在RNN中,您需要为每个输入步骤执行多个前向传递.因此,我担心自己的实现无法像

[...] in an RNN you do multiple forward passes for each input step. Because of that, I fear that my implementation does not work as intended.

没关系.对于每条前进,您通常都应该向后进行一次(似乎是这种情况,请参阅步骤以了解可能的选项).之后,我们(通常)不需要与图有关的损耗,因为我们已经执行了反向传播,获得了渐变并准备优化参数.

It does not matter. For each forward you should usually do one backward (seems to be the case here, see steps for possible options). After that we (usually) don't need loss connected to graph as we already performed backpropagation, got our gradients and are ready to optimize parameters.

这种损失需要有历史记录,因为它可以追溯到流程循环会向后调用它.

That loss needs to have history, as it goes back to the process loop where backward will be called on it.

在显示的过程中无需调用向后.

No need to call backward in process as presented.

这篇关于RNN中的梯度累积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆