在 GPU 上计算导致驱动程序错误“停止响应"; [英] Calculation on GPU leads to driver error "stopped responding"

查看:27
本文介绍了在 GPU 上计算导致驱动程序错误“停止响应";的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有一个在 MATLAB R2013b 中执行的小脚本:

I have this little nonsense script here which I am executing in MATLAB R2013b:

clear all;

n = 2000;
times = 50;
i = 0;

tCPU = tic;

disp 'CPU::'
A = rand(n, n);
B = rand(n, n);
disp '::Go'
for i = 0:times
    CPU = A * B;
end

tCPU = toc(tCPU);
tGPU = tic;

disp 'GPU::'
A = gpuArray(A);
B = gpuArray(B);
disp '::Go'
for i = 0:times
    GPU =  A * B ; 
end
tGPU = toc(tGPU);

fprintf('On CPU: %.2f sec
On GPU: %.2f sec
', tCPU, tGPU);

不幸的是,执行后我收到一条来自 Windows 的消息:显示驱动程序停止工作并已恢复.".

Unfortunately after execution I receive a message from Windows saying: "Display driver stopped working and has recovered.".

我认为这意味着 Windows 没有从我的显卡驱动程序或其他东西那里得到响应.脚本返回没有错误:

Which I assume means that Windows did not get response from my graphic cards driver or something. The script returned without errors:

>> test
CPU::
::Go
GPU::
::Go
On CPU: 11.01 sec
On GPU: 2.97 sec

但是无论GPU是否内存不足,MATLAB在我重新启动之前都无法使用GPU设备.如果我不重新启动 MATLAB,我只会收到一条来自 CUDA 的消息:

But no matter if the GPU runs out of memory or not, MATLAB is not able to use the GPU device before I restarted it. If I don't restart MATLAB I receive just a message from CUDA:

>> test
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT 
> In test at 1 
CPU::
::Go
GPU::
Error using gpuArray
An unexpected error occurred during CUDA execution.
The CUDA error was:
the launch timed out and was terminated

Error in test (line 21)
A = gpuArray(A);

有人知道如何避免这个问题或我在这里做错了什么吗?

Does anybody know how to avoid this issue or what I am doing wrong here?

如果需要,我的 GPU 设备:

If needed, my GPU Device:

>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'GeForce GTX 660M'
                     Index: 1
         ComputeCapability: '3.0'
            SupportsDouble: 1
             DriverVersion: 6
            ToolkitVersion: 5
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 2.1475e+09
                FreeMemory: 1.9037e+09
       MultiprocessorCount: 2
              ClockRateKHz: 950000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

推荐答案

关键信息是 gpuDevice 输出的这部分:

The key piece of information is this part of the gpuDevice output:

KernelExecutionTimeout: 1

这意味着主机显示驱动程序在您运行计算作业的 GPU 上处于活动状态.NVIDIA 显示驱动程序包含一个看门狗定时器,它会终止任何花费超过预定义时间的任务,而不会将控制权交还给驱动程序以进行屏幕刷新.这是为了防止长时间运行或卡住的计算作业通过冻结显示而导致机器无响应的情况.您的 Matlab 脚本的运行时间明显超过了显示驱动程序看门狗定时器限制.一旦发生这种情况,设备上保存的计算上下文就会被破坏,Matlab 就不能再在设备上运行.您可以通过调用 reset,我猜它会在后台运行 cudaDeviceReset().

This means that the host display driver is active on the GPU you are running the compute jobs on. The NVIDIA display driver contains a watchdog timer which kills any task which takes more than a predefined amount of time without yielding control back to the driver for screen refresh. This is intended to prevent the situation where a long running or stuck compute job renders the machine unresponsive by freezing the display. The runtime of your Matlab script is clearly exceeding the display driver watchdog timer limit. Once that happens, the the compute context held on the device is destroyed and Matlab can no longer operate with the device. You might be able to reinitialise the context by calling reset, which I guess will run cudaDeviceReset() under the cover.

互联网上有很多关于这个看门狗计时器的信息 - 例如这个 Stack Overflow 问题.如何修改此超时的解决方案取决于您的操作系统和硬件.避免这种情况的最简单方法是不在显示 GPU 上运行 CUDA 代码,或者增加计算作业的粒度,以便没有一个操作的运行时间超过超时限制.或者只是编写更快的代码...

There is a lot of information about this watchdog timer on the interweb - for example this Stack Overflow question. The solution for how to modify this timeout is dependent on your OS and hardware. The simplest way to avoid this is to not run CUDA code on a display GPU, or increase the granularity of your compute jobs so that no one operation has a runtime which exceeds the timeout limit. Or just write faster code...

这篇关于在 GPU 上计算导致驱动程序错误“停止响应";的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆