RuntimeError: 预计所有张量都在同一设备上,但发​​现至少有两个设备,cuda:0 和 cpu!恢复训练时 [英] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

查看:57
本文介绍了RuntimeError: 预计所有张量都在同一设备上,但发​​现至少有两个设备,cuda:0 和 cpu!恢复训练时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 gpu 上训练时保存了一个检查点.重新加载检查点并继续训练后,我收到以下错误.

I saved a checkpoint while trainig on gpu. after reloading the checkpoint and continue training i get the following error.

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    train(model,optimizer,train_loader,val_loader,criteria=args.criterion,epoch=epoch,batch=batch)
  File "main.py", line 71, in train
    optimizer.step()
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

我的训练代码是:

def train(model,optimizer,train_loader,val_loader,criteria,epoch=0,batch=0):
    batch_count = batch
    if criteria == 'l1':
        criterion = L1_imp_Loss()
    elif criteria == 'l2':
        criterion = L2_imp_Loss()
    if args.gpu and torch.cuda.is_available():
        model.cuda()
        criterion = criterion.cuda()

    print(f'{datetime.datetime.now().time().replace(microsecond=0)} Starting to train..')
    
    while epoch <= args.epochs-1:
        print(f'********{datetime.datetime.now().time().replace(microsecond=0)} Epoch#: {epoch+1} / {args.epochs}')
        model.train()
        interval_loss, total_loss= 0,0
        for i , (input,target) in enumerate(train_loader):
            batch_count += 1
            if args.gpu and torch.cuda.is_available():
                input, target = input.cuda(), target.cuda()
            input, target = input.float(), target.float()
            pred = model(input)
            loss = criterion(pred,target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            ....

保存过程发生在每个 epoch 结束后.

the saving proccess happend after finishing each epoch.

torch.save({'epoch': epoch,'batch':batch_count,'model_state_dict': model.state_dict(),'optimizer_state_dict':
                    optimizer.state_dict(),'loss': total_loss/len(train_loader),'train_set':args.train_set,'val_set':args.val_set,'args':args}, f'{args.weights_dir}/FastDepth_Final.pth')

我不知道为什么我会收到这个错误.args.gpu == True ,我将模型、所有数据和损失函数传递给 cuda,不知何故 cpu 上仍然有一个张量,有人能找出问题所在吗?

I cant figure why i get this error. args.gpu == True , and Im passing the model, all data, and loss function to cuda, somehow there is still a tensor on cpu, could anyone figure out whats wrong?

谢谢.

推荐答案

可能有一个 问题,设备参数开启:

There might be an issue with the device parameters are on:

如果您需要通过 .cuda() 将模型移动到 GPU,请在为其构建优化器之前执行此操作..cuda() 之后的模型参数将与调用之前的对象不同.
通常,在构建和使用优化器时,您应该确保优化参数位于一致的位置.

If you need to move a model to GPU via .cuda() , please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.
In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

这篇关于RuntimeError: 预计所有张量都在同一设备上,但发​​现至少有两个设备,cuda:0 和 cpu!恢复训练时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆