“RuntimeError: CUDA 错误:触发设备端断言"是什么意思?在 PyTorch 中是什么意思? [英] What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?

查看:41
本文介绍了“RuntimeError: CUDA 错误:触发设备端断言"是什么意思?在 PyTorch 中是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看过很多针对特定案例特定问题的特定帖子,但没有基本的动机解释.这是什么错误:

I have seen a lot of specific posts to particular case-specific problems, but no fundamental motivating explanation. What does this error:

RuntimeError: CUDA error: device-side assert triggered

是什么意思?具体来说,被触发的断言是什么,为什么断言在那里,我们如何向后工作以调试问题?

mean? Specifically, what is the assert that is being triggered, why is the assert there, and how do we work backwards to debug the problem?

按原样,此错误消息在诊断任何问题时几乎无用,因为它似乎是在说某处触及 GPU 的某些代码"有问题.Cuda 的文档在这方面似乎也没有帮助,尽管我可能是错的.https://docs.nvidia.com/cuda/cuda-gdb/index.html

As-is, this error message is near useless in diagnosing any problem because of the generality that it seems to say "some code somewhere that touches the GPU" has a problem. The documentation of Cuda also does not seem helpful in this regard, though I could be wrong. https://docs.nvidia.com/cuda/cuda-gdb/index.html

推荐答案

当 CUDA 设备代码运行时检测到设备端错误时,该错误将通过通常的 CUDA 运行时 API 错误报告机制.设备代码中通常检测到的错误类似于非法地址(例如,尝试取消引用无效指针),但另一种类型是设备端断言.每当设备代码中出现 C/C++ assert() 并且断言条件为假时,就会生成此类错误.

When a device-side error is detected while CUDA device code is running, that error is reported via the usual CUDA runtime API error reporting mechanism. The usual detected error in device code would be something like an illegal address (e.g. attempt to dereference an invalid pointer) but another type is a device-side assert. This type of error is generated whenever a C/C++ assert() occurs in device code, and the assert condition is false.

这种错误是由特定内核引起的.CUDA 中的运行时错误检查必然是异步的,但可能至少有 3 种可能的方法来开始调试.

Such an error occurs as a result of a specific kernel. Runtime error checking in CUDA is necessarily asynchronous, but there are probably at least 3 possible methods to start to debug this.

  1. 修改源代码以有效地将异步内核启动转换为同步内核启动,并在每次内核启动后进行严格的错误检查.这将识别导致错误的特定内核.此时,只需查看该内核代码中的各种断言就足够了,但您也可以使用下面的第 2 步或第 3 步.

  1. Modify the source code to effectively convert asynchronous kernel launches to synchronous kernel launches, and do rigorous error-checking after each kernel launch. This will identify the specific kernel that has caused the error. At that point it may be sufficient simply to look at the various asserts in that kernel code, but you could also use step 2 or 3 below.

使用 cuda-memcheck.这是一个类似于设备代码的valgrind"的工具.当您使用 cuda-memcheck 运行您的代码时,它往往会运行得更慢,但运行时错误报告将得到增强.通常最好使用 -lineinfo 编译您的代码.在这种情况下,当触发设备端断言时,cuda-memcheck 将报告断言所在的源代码行号,以及断言本身和为假的条件.您可以在此处查看使用它的演练(尽管带有非法地址错误而不是assert(),但与assert()的过程会类似.

Run your code with cuda-memcheck. This is a tool something like "valgrind for device code". When you run your code with cuda-memcheck, it will tend to run much more slowly, but the runtime error reporting will be enhanced. It is also usually preferable to compile your code with -lineinfo. In that scenario, when a device-side assert is triggered, cuda-memcheck will report the source code line number where the assert is, and also the assert itself and the condition that was false. You can see here for a walkthrough of using it (albeit with an illegal address error instead of assert(), but the process with assert() will be similar.

也应该可以使用调试器.如果您使用诸如 cuda-gdb 之类的调试器(例如在 linux 上),那么调试器将具有回溯报告,该报告将指示断言在哪一行,何时被命中.

It should also be possible to use a debugger. If you use a debugger such as cuda-gdb (e.g. on linux) then the debugger will have back-trace reports that will indicate which line the assert was, when it was hit.

如果从 python 脚本启动 CUDA 代码,则 cuda-memcheck 和调试器都可以使用.

Both cuda-memcheck and the debugger can be used if the CUDA code is launched from a python script.

此时您已经发现断言是什么以及它在源代码中的位置.为什么它在那里不能笼统地回答.这将取决于开发人员的意图,如果它没有被注释或以其他方式明显,您将需要一些方法来以某种方式直觉.如何逆向工作"的问题也是一个通用的调试问题,不特定于 CUDA.您可以在 CUDA 内核代码中使用 printf,也可以使用 cuda-gdb 之类的调试器来协助完成此操作(例如,在断言之前设置断点,并检查机器状态 - 例如变量 - 当断言即将被命中时).

At this point you have discovered what the assert is and where in the source code it is. Why it is there cannot be answered generically. This will depend on the developers intention, and if it is not commented or otherwise obvious, you will need some method to intuit that somehow. The question of "how to work backwards" is also a general debugging question, not specific to CUDA. You can use printf in CUDA kernel code, and also a debugger like cuda-gdb to assist with this (for example, set a breakpoint prior to the assert, and inspect machine state - e.g. variables - when the assert is about to be hit).

这篇关于“RuntimeError: CUDA 错误:触发设备端断言"是什么意思?在 PyTorch 中是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆