pytorch cnn 模型停在 loss.backward() 没有任何提示? [英] pytorch cnn model stop at loss.backward() without any prompt?

查看:57
本文介绍了pytorch cnn 模型停在 loss.backward() 没有任何提示?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是做一个五类文本分类

My aim is to make a five-category text classification

我正在使用 cnnbase 模型运行 bert 微调,但我的项目在 loss.backward() 处停止,cmd 中没有任何提示.

I am running bert fine tuning with cnnbase model but my project stops at loss.backward() without any prompt in cmd.

我的程序在rnn base中成功运行,例如lstmrcnn.

My program runs successfully in rnn base such as lstm and rcnn.

但是当我运行一些 cnnbase 模型时,会出现一个奇怪的错误.

But when I am running some cnnbase model a strange bug appears.

我的cnn模型代码:

import torch
import torch.nn as nn
import torch.nn.functional as F
# from ..Models.Conv import Conv1d
from transformers.modeling_bert import BertPreTrainedModel, BertModel
n_filters = 200
filter_sizes = [2,3,4]
class BertCNN(BertPreTrainedModel):
    def __init__(self, config):
        super(BertPreTrainedModel, self).__init__(config)
        self.num_filters = n_filters
        self.filter_sizes = filter_sizes
        self.bert = BertModel(config)
        for param in self.bert.parameters():
            param.requires_grad = True
        self.convs = nn.ModuleList(
            [nn.Conv2d(1, self.num_filters, (k, config.hidden_size))
                for k in self.filter_sizes])
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.fc_cnn = nn.Linear(self.num_filters *
                                len(self.filter_sizes), config.num_labels)

    def conv_and_pool(self, x, conv):
        x = F.relu(conv(x)).squeeze(3)
        x = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x

    def forward(self, input_ids,
                attention_mask=None, token_type_ids=None, head_mask=None):
        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            head_mask=head_mask)
        encoder_out, text_cls = outputs
        out = encoder_out.unsqueeze(1)
        out = torch.cat([self.conv_and_pool(out, conv)
                         for conv in self.convs], 1)
        out = self.dropout(out)
        out = self.fc_cnn(out)
        return out

我的火车代码:

        for step, batch in enumerate(data):
            self.model.train()
            batch = tuple(t.to(self.device) for t in batch)
            input_ids, input_mask, segment_ids, label_ids = batch
            print("input_ids, input_mask, segment_ids, label_ids SIZE: \n")   
            print(input_ids.size(), input_mask.size(),segment_ids.size(), label_ids.size()) 
            # torch.Size([2, 80]) torch.Size([2, 80]) torch.Size([2, 80]) torch.Size([2])
            logits = self.model(input_ids, segment_ids, input_mask)
            print("logits and label ids size: ",logits.size(), label_ids.size())
            # torch.Size([2, 5]) torch.Size([2])
            loss = self.criterion(output=logits, target=label_ids)
            if len(self.n_gpu) >= 2:
                loss = loss.mean()
            if self.gradient_accumulation_steps > 1:
                loss = loss / self.gradient_accumulation_steps
            if self.fp16:
                with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                    scaled_loss.backward()
                clip_grad_norm_(amp.master_params(self.optimizer), self.grad_clip)
            else:
                loss.backward() # I debug find that the program stop at this line without any error prompt

将批量大小更改为 1问题依旧存在

change the batchsize to 1 the bug still occured

step1 登录:

logits 张量([[ 0.8831, -0.0368, -0.2206, -2.3484, -1.3595]], device='cuda:1',grad_fn=)

logits tensor([[ 0.8831, -0.0368, -0.2206, -2.3484, -1.3595]], device='cuda:1', grad_fn=)

step1 损失:

张量(1.5489, device='cuda:1', grad_fn=NllLossBackward>)

tensor(1.5489, device='cuda:1', grad_fn=NllLossBackward>)

但为什么不能 loss.backward()?

but why can't loss.backward()?

推荐答案

我尝试在linux平台上运行我的程序,它运行成功.

I tried to run my program on linux platform, and it ran successfully.

因此,很有可能是操作系统不同造成的

Therefore, it is very likely that it is caused by different os

上一个操作系统:win 10

Previous os:win 10

这篇关于pytorch cnn 模型停在 loss.backward() 没有任何提示?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆