CUDA 应用程序超时 &几秒钟后失败 - 如何解决这个问题? [英] CUDA apps time out & fail after several seconds - how to work around this?

查看:10
本文介绍了CUDA 应用程序超时 &几秒钟后失败 - 如何解决这个问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到 CUDA 应用程序在失败并退出之前往往有大约 5-15 秒的粗略最大运行时间.我意识到最好不要让 CUDA 应用程序运行那么长时间,但假设使用 CUDA 是正确的选择,并且由于每个线程的顺序工作量必须运行那么长时间,有没有办法延长这段时间或绕过它?

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?

推荐答案

我不是 CUDA 专家,我一直在用 AMD Stream SDK 开发,AFAIK 差不多.

I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.

您可以禁用 Windows 监视程序计时器,但强烈不推荐,原因应该很明显.要禁用它,您需要注册 HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlWatchdogDisplayDisableBugCheck,创建一个 REG_DWORD 并将其设置为 1.您可能还需要在 NVidia 控制面板中执行某些操作.在 CUDA 文档中查找对VPU 恢复"的一些参考.

You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious. To disable it, you need to regedit HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlWatchdogDisplayDisableBugCheck, create a REG_DWORD and set it to 1. You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.

理想情况下,您应该能够将内核操作分解为对数据的多次传递,以将其分解为在时间限制内运行的操作.

Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.

或者,您可以划分问题域,以便每个命令计算更少的输出像素.即,与其一举计算 1,000,000 个输出像素,不如向 gpu 发出 10 个命令,每个命令计算 100,000 个.

Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.

必须适应时间片的基本单元不是整个应用程序,而是单个命令缓冲区的执行.在 AMD Stream SDK 中,通过使用 CtxFlush() 调用显式刷新命令队列,可以将长序列的操作分解为多个时间片.也许CUDA有类似的东西?

The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?

您应该必须在每个时间片上通过 PCIX 总线来回读取所有数据;您可以将纹理等留在 gpu 本地内存中;您只是偶尔完成一些命令缓冲区,以向操作系统证明您没有陷入无限循环.

You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.

最后,GPU 快速,所以如果您的应用程序无法在那 5 或 10 秒内完成有用的工作,我会将其视为出现问题的迹象.

Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.

(再次过时,请参阅下面的更新以获取最新信息) 上面的注册表项已过时.我认为这是 Windows XP 64 位的关键.Vista 和 Windows 7 有新的注册表项.您可以在此处找到它们:http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx或在这里:http://msdn.microsoft.com/en-us/library/ee817001.aspx

(outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx

这已经过时了.假设您安装了 NVIDIA Nsight 工具,为 Cuda 编程禁用 TDR 的最简单方法是打开 Nsight Monitor,单击Nsight Monitor options",然后在General"下将WDDM TDR enabled"设置为 false.这将为您更改注册表设置.关闭并重新启动.在您重新启动之前,对 TDR 注册表设置的任何更改都不会生效.

This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.

尽管 NVIDIA 工具现在允许禁用 TDR,但同样的问题与 AMD/OpenCL 开发人员有关.对于那些:记录 TDR 设置的当前链接位于 https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

这篇关于CUDA 应用程序超时 &几秒钟后失败 - 如何解决这个问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆