在CUDA错误后重置GPU和驱动程序 [英] Resetting GPU and driver after CUDA error

查看:250
本文介绍了在CUDA错误后重置GPU和驱动程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时,我的CUDA程序中的错误导致桌面图形中断(在Windows中)。通常,屏幕保持有些可读性,但是当图形改变时,例如拖动窗口时,会出现大量的半随机彩色像素和小块。

Sometimes, bugs in my CUDA programs cause the desktop graphics to break (in Windows). Typically, the screen remains somewhat readable, but when graphics change, such as when dragging a window, lots of semi-random colored pixels and small blocks appear.

通过更改桌面分辨率来重置GPU和驱动程序,但这没有帮助。我找到的唯一的修复是重新启动计算机。

I have tried to reset the GPU and driver by changing the desktop resolution, but that doesn't help. The only fix I have found is to reboot the computer.

有没有程序或一些技巧,我可以使用驱动程序和GPU重置而不重新启动?

Is there a program out there or some trick I can use to get the driver and GPU to reset without rebooting?

背景:

我有1.0,1.1,1.3和2.0卡,但我只有1.1和2.0卡。我看到1.0和1.1的问题。我相信我已经看到它在1.3。我不确定2.0。内存保护是否在1.3附近添加了一些时间?我几乎肯定它不是由于不稳定的硬件,因为问题似乎是由我的代码中的错误触发,并已消失时,错误修复。当运行完成的代码,卡已经稳定。我在我的1.1卡上看到它之后写了这个问题,但是在我修复了一个错误后消失了,现在我没有任何代码来重现它。也许我应该尝试写在1.1卡上的随机位置,看看是否发生了什么...

I have had 1.0, 1.1, 1.3 and 2.0 cards but I only have a 1.1 and 2.0 card now. I've seen the issue on 1.0 and 1.1. I'm pretty sure I've seen it on 1.3. I'm unsure about 2.0. Did memory protection get added some time around 1.3? I am almost sure it's not due to unstable hardware as the problems have seemed to be triggered by bugs in my code and have disappeared when the bugs were fixed. When running finished code, the cards have been stable. I wrote this question after seeing it on my 1.1 card, but it disappeared after I fixed a bug and now I don't have any code that reproduces it. Maybe I should try to write to random locations on the 1.1 card and see if anything happens...

推荐答案

如果您在Linux上使用特斯拉硬件,并且可以运行nvidia-smi,则可以使用

If you are on Tesla hardware on Linux and can run nvidia-smi, then you can reset the GPU using

nvidia-smi -r

nvidia-smi --gpu-reset

以下是此开关的 man 输出:


重置GPU状态。可以用来清除双位ECC错误或
恢复挂起的GPU。需要-i切换到目标特定设备。
仅适用于Linux。

Resets GPU state. Can be used to clear double bit ECC errors or recover hung GPU. Requires -i switch to target specific device. Available on Linux only.

否则...

真正重置硬件的方法是重新启动。

The way to truly reset the hardware is to reboot.

你所描述的不应该发生。我建议使用不同的硬件测试,让我们知道是否仍然发生。

What you describe shouldn't happen. I recommend testing with different hardware and let us know if it still occurs.

这篇关于在CUDA错误后重置GPU和驱动程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆