从 Windows TDR 检测和恢复? [英] Detecting and recovering from Windows TDR?

查看:84
本文介绍了从 Windows TDR 检测和恢复?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在处理一些 OpenCL 代码时遇到了一个奇怪的问题,Windows TDR 将启动并重置 GPU.有问题的内核仅运行 150 毫秒,并且会在 TDR 将其关闭之前运行数千次(在许多小时的过程中),所以我确信内核本身不应该受到责备.

I've run into an odd issue with some OpenCL code that I'm working on where every once in a blue moon, Windows TDR will kick in and reset the GPU. The offending kernel runs for only 150ms and will run thousands of times (over the course of many hours) before the TDR kills it off, so I'm certain that the kernel itself isn't to blame.

我担心的是,一旦 TDR 启动,内核就会死亡,程序就会陷入永恒的不确定状态.据我所知,对 clFinish 的调用永远不会返回.

My concern is that once the TDR kicks in, the kernel dies and the program is stuck in an eternal state of limbo. From what I can tell the call to clFinish never returns.

有没有办法检测内核是否已经死亡,以便可以正常处理?

Is there a way to detect if a kernel has died off so that it can be handled gracefully?

推荐答案

我设法想出了一个解决方案,尽管它远非最佳.

I managed to come up with a solution, although it's far from optimal.

我修改了程序,以便在单独的线程中完成 OpenCL 处理.我在父进程和子进程之间创建了一个全局共享看门狗变量.当父进程将处理函数作为线程生成时,它将变量设置为以毫秒为单位的当前时间.当处理线程完成时,它将看门狗变量重置为零.

I've modified the program so that the OpenCL processing is done in a separate thread. I created a global shared watchdog variable between the parent and child process. When the parent spawns the processing function as a thread, it sets the variable to the current time in milliseconds. When the processing thread finishes, it reset the watchdog variable to zero.

当父线程等待处理线程完成时,它会监视看门狗定时器.如果计时器超过某个阈值,则程序会强制终止自身,而无需等待子进程返回.

While the parent thread waits for the processing thread to finish, it keeps an eye on the watchdog timer. If the timer exceeds a certain threshold then the program forcefully terminates itself without waiting for the child process to return.

无论是否设置了 Windows TDR,此解决方案都适用.如果设置了 TDR 并且驱动程序重置,则调用 clFinish() 将永远不会返回,并且一旦看门狗定时器跳闸,父进程将终止.如果未设置 TDR,失控进程将冻结显示,但一旦看门狗定时器跳闸,父进程将终止处理,结束冻结.

This solution works with or without Windows TDR set. If TDR is set and the driver resets, the call to clFinish() will never return and the parent will terminate once the watchdog timer trips. If TDR is not set, the runaway process will freeze the display, but once the watchdog timer trips, the parent will terminate processing, ending the freeze.

既然我已经设置了看门狗,我只需将我的程序包装在一个脚本中:如果它以错误终止(正返回代码),那么程序将重新运行.

Now that I have a watchdog set up, I simply wrapped my program in a script: if it terminated in error (positive return code) then the program is rerun.

这篇关于从 Windows TDR 检测和恢复?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆