Pytorch:RuntimeError:减少同步失败:cudaErrorAssert:设备端断言已触发 [英] Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

查看:1245
本文介绍了Pytorch:RuntimeError:减少同步失败:cudaErrorAssert:设备端断言已触发的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试训练此数据集上的.

I am running into the following error when trying to train this on this dataset.

由于这是论文中发布的配置,因此我假设我做错了非常大的事.

Since this is the configuration published in the paper, I am assuming I am doing something incredibly wrong.

每次我尝试进行训练时,此错误都会在不同的图像上出现.

This error arrives on a different image every time I try to run training.

C:/w/1/s/windows/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1741, in <module>
    main()
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Noam/Code/vision_course/hopenet/deep-head-pose/code/original_code_augmented/train_hopenet_with_validation_holdout.py", line 187, in <module>
    loss_reg_yaw = reg_criterion(yaw_predicted, label_yaw_cont)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\modules\loss.py", line 431, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\functional.py", line 2204, in mse_loss
    ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered


有什么想法吗?


Any ideas?

推荐答案

使用NLLLossCrossEntropyLoss,并且数据集带有负标签(或标签大于类数)时,通常会发生这种错误. .这也是您使断言t >= 0 && t < n_classes失败的确切错误.

This kind of error generally occurs when using NLLLoss or CrossEntropyLoss, and when your dataset has negative labels (or labels greater than the number of classes). That is also the exact error you are getting Assertion t >= 0 && t < n_classes failed.

对于MSELoss不会发生这种情况,但是OP提到某个地方有一个CrossEntropyLoss,因此会发生错误(程序在另一行异步崩溃).解决方案是清理数据集并确保满足t >= 0 && t < n_classes(其中t表示标签).

This won't occur for MSELoss, but OP mentions that there is a CrossEntropyLoss somewhere and thus the error occurs (the program crashes asynchronously on some other line). The solution is to clean the dataset and ensure that t >= 0 && t < n_classes is satisfied (where t represents the label).

此外,如果使用NLLLossBCELoss(则分别需要激活softmaxsigmoid),请确保网络输出在0到1的范围内.请注意,对于CrossEntropyLossBCEWithLogitsLoss,这不是必需的,因为它们在损失函数中实现了激活函数. (感谢@PouyaB指出).

Also, ensure that your network output is in the range 0 to 1 in case you use NLLLoss or BCELoss (then you require softmax or sigmoid activation respectively). Note that this is not required for CrossEntropyLoss or BCEWithLogitsLoss because they implement the activation function inside the loss function. (Thanks to @PouyaB for pointing out).

这篇关于Pytorch:RuntimeError:减少同步失败:cudaErrorAssert:设备端断言已触发的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆