GPU上的计算导致驱动程序错误“停止响应" [英] Calculation on GPU leads to driver error "stopped responding"

查看:225
本文介绍了GPU上的计算导致驱动程序错误“停止响应"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 MATLAB R2013b 中执行了这个无意义的小脚本:

I have this little nonsense script here which I am executing in MATLAB R2013b:

clear all;

n = 2000;
times = 50;
i = 0;

tCPU = tic;

disp 'CPU::'
A = rand(n, n);
B = rand(n, n);
disp '::Go'
for i = 0:times
    CPU = A * B;
end

tCPU = toc(tCPU);
tGPU = tic;

disp 'GPU::'
A = gpuArray(A);
B = gpuArray(B);
disp '::Go'
for i = 0:times
    GPU =  A * B ; 
end
tGPU = toc(tGPU);

fprintf('On CPU: %.2f sec
On GPU: %.2f sec
', tCPU, tGPU);

不幸的是,在执行后,我收到来自 Windows 的消息:显示驱动程序停止工作并已恢复.".

Unfortunately after execution I receive a message from Windows saying: "Display driver stopped working and has recovered.".

我认为这意味着 Windows 没有从我的显卡驱动程序或其他方面得到响应.返回的脚本没有错误:

Which I assume means that Windows did not get response from my graphic cards driver or something. The script returned without errors:

>> test
CPU::
::Go
GPU::
::Go
On CPU: 11.01 sec
On GPU: 2.97 sec

但是无论GPU是否内存不足,在我重新启动之前,MATLAB都无法使用GPU设备.如果我不重新启动 MATLAB,我只会收到来自 CUDA 的消息:

But no matter if the GPU runs out of memory or not, MATLAB is not able to use the GPU device before I restarted it. If I don't restart MATLAB I receive just a message from CUDA:

>> test
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
CPU::
::Go
GPU::
Error using gpuArray
An unexpected error occurred during CUDA execution.
The CUDA error was:
the launch timed out and was terminated

Error in test (line 21)
A = gpuArray(A);

有谁知道如何避免这个问题或我在这里做错了什么?

Does anybody know how to avoid this issue or what I am doing wrong here?

如果需要,我的 GPU 设备:

If needed, my GPU Device:

>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'GeForce GTX 660M'
                     Index: 1
         ComputeCapability: '3.0'
            SupportsDouble: 1
             DriverVersion: 6
            ToolkitVersion: 5
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 2.1475e+09
                FreeMemory: 1.9037e+09
       MultiprocessorCount: 2
              ClockRateKHz: 950000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

推荐答案

关键信息是这部分 gpuDevice 输出:

The key piece of information is this part of the gpuDevice output:

KernelExecutionTimeout: 1

这意味着主机显示驱动程序在您运行计算作业的 GPU 上处于活动状态.NVIDIA 显示驱动程序包含一个看门狗计时器,它会终止任何花费超过预定义时间的任务,而不会将控制权交还给驱动程序以进行屏幕刷新.这是为了防止长时间运行或卡住的计算作业通过冻结显示而使机器无响应的情况.您的 Matlab 脚本的运行时间显然超出了显示驱动程序看门狗定时器的限制.一旦发生这种情况,设备上保存的计算上下文将被破坏,Matlab 将无法再与设备一起运行.您可以通过调用 reset,我猜它会在掩护下运行 cudaDeviceReset().

This means that the host display driver is active on the GPU you are running the compute jobs on. The NVIDIA display driver contains a watchdog timer which kills any task which takes more than a predefined amount of time without yielding control back to the driver for screen refresh. This is intended to prevent the situation where a long running or stuck compute job renders the machine unresponsive by freezing the display. The runtime of your Matlab script is clearly exceeding the display driver watchdog timer limit. Once that happens, the the compute context held on the device is destroyed and Matlab can no longer operate with the device. You might be able to reinitialise the context by calling reset, which I guess will run cudaDeviceReset() under the cover.

网上有很多关于这个看门狗定时器的信息 - 例如这个堆栈溢出问题.如何修改此超时的解决方案取决于您的操作系统和硬件.避免这种情况的最简单方法是不在显示 GPU 上运行 CUDA 代码,或者增加计算作业的粒度,以便没有任何操作的运行时间超过超时限制.或者只是编写更快的代码...

There is a lot of information about this watchdog timer on the interweb - for example this Stack Overflow question. The solution for how to modify this timeout is dependent on your OS and hardware. The simplest way to avoid this is to not run CUDA code on a display GPU, or increase the granularity of your compute jobs so that no one operation has a runtime which exceeds the timeout limit. Or just write faster code...

这篇关于GPU上的计算导致驱动程序错误“停止响应"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆