"RuntimeError:CUDA错误:设备端断言已触发"是什么意思?在PyTorch中是什么意思? [英] What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?

查看:483
本文介绍了"RuntimeError:CUDA错误:设备端断言已触发"是什么意思?在PyTorch中是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到很多针对特定案例的问题的特定帖子,但是没有根本的动机解释.这是什么错误:

I have seen a lot of specific posts to particular case-specific problems, but no fundamental motivating explanation. What does this error:

RuntimeError: CUDA error: device-side assert triggered

是什么意思?具体来说,什么是触发的断言,为什么断言在那里,以及我们如何向后工作以调试问题?

mean? Specifically, what is the assert that is being triggered, why is the assert there, and how do we work backwards to debug the problem?

按原样,此错误消息在诊断任何问题时几乎没有用,因为一般性地说某些与GPU接触的代码"存在问题.尽管我可能是错的,但Cuda的文档在这方面似乎也没有帮助. https://docs.nvidia.com/cuda/cuda-gdb/index.html

As-is, this error message is near useless in diagnosing any problem because of the generality that it seems to say "some code somewhere that touches the GPU" has a problem. The documentation of Cuda also does not seem helpful in this regard, though I could be wrong. https://docs.nvidia.com/cuda/cuda-gdb/index.html

推荐答案

当CUDA设备代码正在运行时检测到设备端错误时,该错误将通过通常的

When a device-side error is detected while CUDA device code is running, that error is reported via the usual CUDA runtime API error reporting mechanism. The usual detected error in device code would be something like an illegal address (e.g. attempt to dereference an invalid pointer) but another type is a device-side assert. This type of error is generated whenever a C/C++ assert() occurs in device code, and the assert condition is false.

由于特定内核而导致发生错误. CUDA中的运行时错误检查必须是异步的,但可能至少有3种可能的方法可以开始对此进行调试.

Such an error occurs as a result of a specific kernel. Runtime error checking in CUDA is necessarily asynchronous, but there are probably at least 3 possible methods to start to debug this.

  1. 修改源代码,以有效地将异步内核启动转换为同步内核启动,并在每次内核启动后进行严格的错误检查.这将确定导致错误的特定内核.到那时,仅查看该内核代码中的各个断言可能就足够了,但是您也可以使用下面的步骤2或3.

  1. Modify the source code to effectively convert asynchronous kernel launches to synchronous kernel launches, and do rigorous error-checking after each kernel launch. This will identify the specific kernel that has caused the error. At that point it may be sufficient simply to look at the various asserts in that kernel code, but you could also use step 2 or 3 below.

使用 cuda-memcheck 运行代码.这是类似于用于设备代码的valgrind"的工具.当您使用cuda-memcheck运行代码时,它的运行速度往往会慢得多,但是会增强运行时错误报告.通常也最好使用-lineinfo编译代码.在那种情况下,当触发设备端断言时,cuda-memcheck将报告断言所在的源代码行号,以及断言本身和为假的条件.您可以在此处中找到使用它的演练(尽管非法的地址错误而不是assert(),但是使用assert()的过程将类似.

Run your code with cuda-memcheck. This is a tool something like "valgrind for device code". When you run your code with cuda-memcheck, it will tend to run much more slowly, but the runtime error reporting will be enhanced. It is also usually preferable to compile your code with -lineinfo. In that scenario, when a device-side assert is triggered, cuda-memcheck will report the source code line number where the assert is, and also the assert itself and the condition that was false. You can see here for a walkthrough of using it (albeit with an illegal address error instead of assert(), but the process with assert() will be similar.

也应该可以使用调试器.如果您使用cuda-gdb之类的调试器(例如在linux上),则该调试器将具有向后跟踪报告,这些报告将指出断言所在的行以及何时被命中.

It should also be possible to use a debugger. If you use a debugger such as cuda-gdb (e.g. on linux) then the debugger will have back-trace reports that will indicate which line the assert was, when it was hit.

如果从python脚本启动CUDA代码,则可以同时使用cuda-memcheck和调试器.

Both cuda-memcheck and the debugger can be used if the CUDA code is launched from a python script.

在这一点上,您已经发现了断言是什么以及它在源代码中的位置.为什么会在那里不能被普遍回答.这将取决于开发人员的意图,并且如果未对其进行评论或以其他方式显而易见,则您将需要某种方法来以某种方式进行理解.关于如何向后工作"的问题被称为如何向后工作".这也是一个通用的调试问题,并非特定于CUDA.您可以在CUDA内核代码中使用printf,也可以在cuda-gdb之类的调试器中使用此工具(例如,在断言之前设置断点,并在断言即将被执行时检查机器状态(例如变量).点击).

At this point you have discovered what the assert is and where in the source code it is. Why it is there cannot be answered generically. This will depend on the developers intention, and if it is not commented or otherwise obvious, you will need some method to intuit that somehow. The question of "how to work backwards" is also a general debugging question, not specific to CUDA. You can use printf in CUDA kernel code, and also a debugger like cuda-gdb to assist with this (for example, set a breakpoint prior to the assert, and inspect machine state - e.g. variables - when the assert is about to be hit).

这篇关于"RuntimeError:CUDA错误:设备端断言已触发"是什么意思?在PyTorch中是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆