CUDA racecheck,共享内存阵列和cudaDeviceSynchronize() [英] CUDA racecheck, shared memory array and cudaDeviceSynchronize()

查看:211
本文介绍了CUDA racecheck,共享内存阵列和cudaDeviceSynchronize()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现了 cuda-memcheck racecheck 工具,该工具可在CUDA 5.0中使用(cuda-memcheck --tool racecheck,请参见

I recently discovered the racecheck tool of cuda-memcheck, available in CUDA 5.0 (cuda-memcheck --tool racecheck, see the NVIDIA doc). This tool can detect race conditions with shared memory in a CUDA kernel.

在调试模式下,此工具未检测到任何东西,这显然是正常的.但是,在发布模式(-O3)中,根据问题的参数,我会收到错误消息.

In debug mode, this tool does not detect anything, which is apparently normal. However, in release mode (-O3), I get errors depending on the parameters of the problem.

这是一个错误示例(在第22行初始化共享内存,在第119行分配):

Here is an error example (initialization of shared memory on line 22, assignment on line 119):

==========错误:在块(35,0,0)的共享 0x0处检测到潜在的WAW危险: ==========在.... h:119:void kernel_test3(Data *)的0x00000890处写入线程(32,0,0) ==========在.... h:22:void kernel_test3(Data *)<0>处以0x00000048写入线程(0,0,0) ==========当前值:13,传入值:0

========= ERROR: Potential WAW hazard detected at shared 0x0 in block (35, 0, 0) : ========= Write Thread (32, 0, 0) at 0x00000890 in ....h:119:void kernel_test3(Data*) ========= Write Thread (0, 0, 0) at 0x00000048 in ....h:22:void kernel_test3(Data*)
========= Current Value : 13, Incoming Value : 0

  1. 让我惊讶的第一件事是线程ID.当我第一次遇到该错误时,每个块包含32个线程(标识0到31).那么,为什么线程ID 32会出现问题?我什至在threadIdx.x上添加了额外的检查,但这并没有改变.
  2. 我使用共享内存作为临时缓冲区,每个线程处理自己的多维数组参数,例如__shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK].我真的不明白怎么可能有任何竞争条件,因为每个线程都处理共享内存中自己的一部分.
  3. 将网格大小从64个块减少到32个块似乎可以解决此问题(每个块32个线程).我不明白为什么.
  1. The first thing that surprised me is the thread ids. When I first encountered the error, each block contained 32 threads (ids 0 to 31). So why is there a problem with the thread id 32? I even added an extra check on threadIdx.x, but this changed nothing.
  2. I use shared memory as a temporary buffer, and each thread deals with its own parameters of a multidimensional array, e.g. __shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]. I do not really understand how there could be any race conditions, since each thread deals with its own part of shared memory.
  3. Reducing the grid size from 64 blocks to 32 blocks seemed to solve the issue (with 32 threads per block). I do not understand why.

为了了解正在发生的事情,我使用了一些更简单的内核进行了测试. 让我向您展示一个产生此类错误的内核示例.基本上,此内核使用共享内存的SIZE_X*SIZE_Y*NTHREADS*sizeof(float) B,每个SM可以使用48KB共享内存.

In order to understand what was happening, I tested with some simpler kernels. Let me show you an example of a kernel that creates that kind of error. Basically, this kernel uses SIZE_X*SIZE_Y*NTHREADS*sizeof(float) B of shared memory, and I can use 48KB of shared memory per SM.

test.cu

template <unsigned int NTHREADS>
__global__ void kernel_test()
{
    const int SIZE_X = 4;
    const int SIZE_Y = 4;

    __shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];

    for (unsigned int i = 0; i < SIZE_X; i++)
        for (unsigned int j = 0; j < SIZE_Y; j++)
            tmp[i][j][threadIdx.x] = threadIdx.x;
}

int main()
{
  const unsigned int NTHREADS = 32;

  //kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
  kernel_test<NTHREADS><<<64, NTHREADS>>>();

  cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}

编译:

nvcc test.cu --ptxas-options=-v -o test

如果我们运行内核:

cuda-memcheck --tool racecheck test

  • kernel_test<32><<<32, 32>>>();:32个块,32个线程=>不会导致任何明显的比赛检查错误.
  • kernel_test<32><<<64, 32>>>();:64个块,32个线程=>会导致WAW危害(threadId.x = 32 ?!)和错误.
  • kernel_test<32><<<32, 32>>>(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.
  • kernel_test<32><<<64, 32>>>(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.

==========错误:在块(57,0,0)中的共享 0x6处检测到潜在的WAW危险:
==========在.... h:403:void kernel_test(void)的0x00000048处写入线程(0,0,0)
==========在.... h:403:void kernel_test(void)的0x00000048处写入线程(1、0、0)
==========当前值:0,传入值:128

========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128

==========信息:(正在写入相同的数据)在块(47,0,0)的共享 0x0处检测到潜在的WAW危险:
==========在.... h:403:void kernel_test(void)的0x00000048处写入线程(32,0,0)
==========在.... h:403:void kernel_test(void)的0x00000048处写入线程(0,0,0)
==========当前值:0,传入值:0

========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0

那么我在这里想念什么?我在共享内存上做错什么了吗? (我仍然是初学者)

So what am I missing here? Am I doing something wrong with shared memory? (I am still a beginner with this)

问题似乎是由于NBLOCKS > 32引起的.为什么会这样?

The problem seems to be coming from cudaDeviceSynchronize() when NBLOCKS > 32. Why is this happening?

推荐答案

这显然是NVIDIA Linux驱动程序中的错误.该错误在313.18版本发布后消失了.

This was apparently a bug in NVIDIA drivers for Linux. The bug disappeared after the 313.18 release.

这篇关于CUDA racecheck,共享内存阵列和cudaDeviceSynchronize()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆