遇到 cuda 无法纠正的 ECC 错误 [英] cuda uncorrectable ECC error encountered

查看:19
本文介绍了遇到 cuda 无法纠正的 ECC 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的环境是

  • Windows 7 x64
  • Matlab 2012a x64
  • Cuda SDK 4.2
  • 特斯拉 C2050 GPU

我无法弄清楚为什么我的 GPU 因遇到无法纠正的 ECC 错误"而崩溃.仅当我使用 512 个线程或更多线程时才会出现此错误.我无法发布内核,但我会尝试描述它的作用.

I am having trouble figuring out why my GPU is crashing with the "uncorrectable ECC error encountered". This error only occurs when i use 512 threads or more. I can't post the kernel, but i will try to describe what it does.

一般来说,内核采用多个参数并产生由线程大小 M 和另一个数字 N 定义的 2 个复矩阵.因此返回的矩阵大小为 MxN.典型配置是 512x512,但每个数字都是独立的,可以上下变化.当数字为 256x256 时,内核工作.

In general, the kernel takes a number of parameters and produces 2 complex matricies defined by the thread size, M and another number, N. So the returned matrices will be of size MxN. A typical configuration is 512x512, but each number is independent and can vary up or down. The kernel works when the numbers are 256x256.

每个线程(内核)根据线程 id 从二维数组中提取一个 999 大小的向量,即大小为 999xM,然后循环遍历输出矩阵的行 (0 .. N-1) 进行计算.计算了许多中间参数,仅使用 + - */ 运算符中的 pow、sin 和 cos.为了计算其中一个输出矩阵,需要执行一个额外的循环来总结之前提取的 999 个向量的贡献.该循环进行一些中间计算以确定允许贡献的值范围.然后按由计算的分数值的余弦和正弦值确定的因子对贡献进行缩放.这是它崩溃的地方.如果我坚持一个常数值或 1.0 或任何其他值,内核将毫无问题地执行.但是,当只包含一个调用(cos 或 sine)时,内核会崩溃.

Each thread (kernel) extracts a 999 size vector out of a 2D array based on the thread id, ie size 999xM, then cycles through the row (0 .. N-1) of the output matrices for calculation. A number of intermediate parameters are calculated, only using pow, sin and cos among the + - * / operators. To calculate one of the output matrices an additional loop needs to be executed to sum up the contribution of the 999 vector that was extracted earlier. This loop does some intermediate calculations to determine a range of values that will allow contribution. The contribution is then scaled by a factor determined by the cos and sine values of a calculated fractional value. This is where it crashes. If i stick in a constant value or 1.0 or any other for that matter, the kernel executes without trouble. however, when only one of the calls (cos or sine) is included, the kernel crashes.

一些伪代码如下:

kernel()
{

/* Extract 999 vector from 2D array 999xM - one 999 vector for each thread. */
for (int i = 0; i < 999; i++)
{
    .....
}

/* Cycle through the 2nd dimension of the output matricies */
for (int j = 0; j < N; j++)
{
    /* Calculate some intermediate variables */

    /* Calculate the real and imaginary components of the first output matrix */
    /* real = cos(value), imaginary = sin(value) */

    /* Construct the first output matrix from some intermediate variables and the real and imaginary components */

    /* Calculate some more intermediate variables */

    /* cycle through the extracted vector (0 .. 998) */
    for (int k = 0; k < 999; k++)
    {

        /* Calculate some more intermediate variables */

        /* Determine the range of allowed values to contribute to the second output matrix. */

        /* Calculate the real and imaginary components of the second output matrix */
        /* real = cos(value), imaginary = sin(value) */
        /* This is were it crashes, unless real and imaginary are constant values (1.0) */

        /* Sum up the contributions of the extracted vector to the second output matrix */

     }
     /* Construct the Second output matrix from some intermediate variables and the real and imaginary components */

}
}

我认为这可能是由于寄存器限制,但占用计算器表明情况并非如此,我使用的寄存器少于 512 个线程的 32,768 个寄存器.任何人都可以就这可能是什么原因提出任何建议?

I thought this could be due to a register limit, but the occupancy calculator indicates that this is not the case, I'm using less than the 32,768 registers with 512 threads. Can anyone give any suggestions as to what the cause of this could be?

这里是 ptasx 信息:

Here is the ptasx info:

ptxas info    : Compiling entry function '_Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_' for 'sm_20' 

ptxas info    : Function properties for _Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_ 

8056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 

ptxas info    : Function properties for __internal_trig_reduction_slowpathd 

40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 

ptxas info    : Used 53 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16]

tmpxft_00001d70_00000000-3_MexFunciton.cudafe1.cpp 

推荐答案

无法纠正的ECC错误"通常是指硬件故障.ECC 是纠错码,一种检测和纠正存储在 RAM 中的位中的错误的方法.一条杂散的宇宙射线可能每隔一段时间就会破坏存储在 RAM 中的一位,但不可纠正的 ECC 错误"表示从 RAM 存储中出来的几个位是错误的"——ECC 无法恢复原始位值太多.

"Uncorrectable ECC error" usually refers to a hardware failure. ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.

这可能意味着您的 GPU 设备内存中有一个坏的或边缘的 RAM 单元.

This could mean that you have a bad or marginal RAM cell in your GPU device memory.

任何类型的边际电路都可能不会 100% 失效,但在大量使用的压力下更有可能失效 - 以及相关的温度升高.

Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.

有一些诊断实用程序可以对您 PC 的所有 RAM 组进行压力测试,以确认或查明哪个芯片出现故障,但我不知道用于测试 GPU 的设备 RAM 组的模拟程序.

There are diagnostic utilities floating around to stress-test all the RAM banks of your PC to confirm or pinpoint which chip is failing, but I don't know of an analog for testing the device RAM banks of the GPU.

如果您可以访问具有类似功能的 GPU 的另一台机器,请尝试在该机器上运行您的应用程序以查看其行为方式.如果您在第二台机器上没有收到 ECC 错误,这证实了几乎可以肯定问题出在第一台机器的硬件上.如果您在第二台机器上遇到相同的 ECC 错误,请忽略我在此处编写的所有内容并继续查找您的软件错误.除非您的代码确实造成了硬件损坏,否则两台机器出现相同硬件故障的可能性极小.

If you have access to another machine with a GPU of similar capability, try running your app on that machine to see how it behaves. If you don't get the ECC error on the second machine, this confirms that the problem is almost certainly in the hardware of the first machine. If you get the same ECC error on the second machine, then ignore everything I've written here and continue looking for your software bug. Unless your code is actually causing hardware damage, the chances of two machines having the same hardware failure are extremely small.

这篇关于遇到 cuda 无法纠正的 ECC 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆