遇到CUDA不可纠正的ECC错误 [英] cuda uncorrectable ECC error encountered

查看:1018
本文介绍了遇到CUDA不可纠正的ECC错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的环境是


  • Windows 7 x64

  • Matlab 2012a x64

  • Cuda SDK 4.2

  • Tesla C2050 GPU

  • Windows 7 x64
  • Matlab 2012a x64
  • Cuda SDK 4.2
  • Tesla C2050 GPU

我遇到了麻烦找出为什么我的GPU因遇到无法纠正的ECC错误而崩溃的原因。仅当我使用512个或更多线程时,才会发生此错误。我无法发布内核,但是我将尝试描述它的作用。

I am having trouble figuring out why my GPU is crashing with the "uncorrectable ECC error encountered". This error only occurs when i use 512 threads or more. I can't post the kernel, but i will try to describe what it does.

通常,内核采用许多参数并产生2个复杂的矩阵,分别由线程大小M和另一个数字N。因此返回的矩阵大小为MxN。典型的配置是512x512,但是每个数字都是独立的,并且可以向上或向下变化。当数字为256x256时,内核开始工作。

In general, the kernel takes a number of parameters and produces 2 complex matricies defined by the thread size, M and another number, N. So the returned matrices will be of size MxN. A typical configuration is 512x512, but each number is independent and can vary up or down. The kernel works when the numbers are 256x256.

每个线程(内核)根据线程ID(即大小999xM)从2D数组中提取999大小的向量,然后循环通过输出矩阵的行(0 .. N-1)进行计算。仅使用 +-* / 运算符之间的pow,sin和cos即可计算出许多中间参数。为了计算输出矩阵之一,需要执行一个附加循环以汇总先前提取的999向量的贡献。该循环进行一些中间计算,以确定允许贡献的值范围。然后,通过由计算的分数值的余弦值和正弦值确定的因子来缩放贡献。这是它崩溃的地方。如果我坚持使用恒定值或1.0或其他任何值,则内核执行不会有麻烦。但是,当仅包含一个调用(余弦或正弦)时,内核崩溃。

Each thread (kernel) extracts a 999 size vector out of a 2D array based on the thread id, ie size 999xM, then cycles through the row (0 .. N-1) of the output matrices for calculation. A number of intermediate parameters are calculated, only using pow, sin and cos among the + - * / operators. To calculate one of the output matrices an additional loop needs to be executed to sum up the contribution of the 999 vector that was extracted earlier. This loop does some intermediate calculations to determine a range of values that will allow contribution. The contribution is then scaled by a factor determined by the cos and sine values of a calculated fractional value. This is where it crashes. If i stick in a constant value or 1.0 or any other for that matter, the kernel executes without trouble. however, when only one of the calls (cos or sine) is included, the kernel crashes.

某些伪代码如下:

kernel()
{

/* Extract 999 vector from 2D array 999xM - one 999 vector for each thread. */
for (int i = 0; i < 999; i++)
{
    .....
}

/* Cycle through the 2nd dimension of the output matricies */
for (int j = 0; j < N; j++)
{
    /* Calculate some intermediate variables */

    /* Calculate the real and imaginary components of the first output matrix */
    /* real = cos(value), imaginary = sin(value) */

    /* Construct the first output matrix from some intermediate variables and the real and imaginary components */

    /* Calculate some more intermediate variables */

    /* cycle through the extracted vector (0 .. 998) */
    for (int k = 0; k < 999; k++)
    {

        /* Calculate some more intermediate variables */

        /* Determine the range of allowed values to contribute to the second output matrix. */

        /* Calculate the real and imaginary components of the second output matrix */
        /* real = cos(value), imaginary = sin(value) */
        /* This is were it crashes, unless real and imaginary are constant values (1.0) */

        /* Sum up the contributions of the extracted vector to the second output matrix */

     }
     /* Construct the Second output matrix from some intermediate variables and the real and imaginary components */

}
}

我认为这可能是由于寄存器限制所致,但是占用计算器表明情况并非如此,我所使用的少于32,768个具有512个线程的寄存器。谁能给出任何有关这可能是什么原因的建议?

I thought this could be due to a register limit, but the occupancy calculator indicates that this is not the case, I'm using less than the 32,768 registers with 512 threads. Can anyone give any suggestions as to what the cause of this could be?

以下是ptasx信息:

Here is the ptasx info:

ptxas info    : Compiling entry function '_Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_' for 'sm_20' 

ptxas info    : Function properties for _Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_ 

8056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 

ptxas info    : Function properties for __internal_trig_reduction_slowpathd 

40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 

ptxas info    : Used 53 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16]

tmpxft_00001d70_00000000-3_MexFunciton.cudafe1.cpp 


推荐答案

不可纠正的ECC错误通常是指硬件故障。 ECC是纠错码,一种检测和纠正RAM中存储的位中的错误的方法。宇宙射线会在很长一段时间内一次破坏一次存储在RAM中的位,但是不可纠正的ECC错误表示有几位从RAM存储中出来是错误的-对于ECC来说,要恢复原始位值太多了。

"Uncorrectable ECC error" usually refers to a hardware failure. ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.

这可能意味着您的GPU设备内存中的RAM单元有故障或边缘。

This could mean that you have a bad or marginal RAM cell in your GPU device memory.

任何种类的边缘电路都可能不会100%失效,但在大量使用的压力下以及随之而来的温度升高下,失效的可能性更大。

Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.

周围有诊断实用程序可以对PC的所有RAM组进行压力测试,以确认或查明哪个芯片发生故障,但我不知道用于测试设备RAM的模拟量

There are diagnostic utilities floating around to stress-test all the RAM banks of your PC to confirm or pinpoint which chip is failing, but I don't know of an analog for testing the device RAM banks of the GPU.

如果您可以访问另一台具有类似功能的GPU的计算机,请尝试在该计算机上运行应用程序以查看其性能。如果您在第二台计算机上没有收到ECC错误,则可以确认该问题几乎可以肯定是第一台计算机的硬件造成的。如果您在第二台计算机上遇到相同的ECC错误,请忽略我在此处编写的所有内容,然后继续查找软件错误。除非您的代码实际上造成了硬件损坏,否则两台机器发生相同硬件故障的机会非常小。

If you have access to another machine with a GPU of similar capability, try running your app on that machine to see how it behaves. If you don't get the ECC error on the second machine, this confirms that the problem is almost certainly in the hardware of the first machine. If you get the same ECC error on the second machine, then ignore everything I've written here and continue looking for your software bug. Unless your code is actually causing hardware damage, the chances of two machines having the same hardware failure are extremely small.

这篇关于遇到CUDA不可纠正的ECC错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆