CUDA racecheck,共享内存阵列和cudaDeviceSynchronize() [英] CUDA racecheck, shared memory array and cudaDeviceSynchronize()
问题描述
我最近发现了 cuda-memcheck 的 racecheck 工具,该工具可在CUDA 5.0中使用(cuda-memcheck --tool racecheck
,请参见
I recently discovered the racecheck tool of cuda-memcheck, available in CUDA 5.0 (cuda-memcheck --tool racecheck
, see the NVIDIA doc). This tool can detect race conditions with shared memory in a CUDA kernel.
在调试模式下,此工具未检测到任何东西,这显然是正常的.但是,在发布模式(-O3)中,根据问题的参数,我会收到错误消息.
In debug mode, this tool does not detect anything, which is apparently normal. However, in release mode (-O3), I get errors depending on the parameters of the problem.
这是一个错误示例(在第22行初始化共享内存,在第119行分配):
Here is an error example (initialization of shared memory on line 22, assignment on line 119):
==========错误:在块(35,0,0)的共享 0x0处检测到潜在的WAW危险: ==========在.... h:119:void kernel_test3(Data *)的0x00000890处写入线程(32,0,0) ==========在.... h:22:void kernel_test3(Data *)<0>处以0x00000048写入线程(0,0,0) ==========当前值:13,传入值:0
========= ERROR: Potential WAW hazard detected at shared 0x0 in block (35, 0, 0) : ========= Write Thread (32, 0, 0) at 0x00000890 in ....h:119:void kernel_test3(Data*) ========= Write Thread (0, 0, 0) at 0x00000048 in ....h:22:void kernel_test3(Data*)
========= Current Value : 13, Incoming Value : 0
- 让我惊讶的第一件事是线程ID.当我第一次遇到该错误时,每个块包含32个线程(标识0到31).那么,为什么线程ID 32会出现问题?我什至在
threadIdx.x
上添加了额外的检查,但这并没有改变.
我使用共享内存作为临时缓冲区,每个线程处理自己的多维数组参数,例如 - 将网格大小从64个块减少到32个块似乎可以解决此问题(每个块32个线程).我不明白为什么.
__shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]
.我真的不明白怎么可能有任何竞争条件,因为每个线程都处理共享内存中自己的一部分.
- The first thing that surprised me is the thread ids. When I first encountered the error, each block contained 32 threads (ids 0 to 31). So why is there a problem with the thread id 32? I even added an extra check on
threadIdx.x
, but this changed nothing. - I use shared memory as a temporary buffer, and each thread deals with its own parameters of a multidimensional array, e.g.
__shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]
. I do not really understand how there could be any race conditions, since each thread deals with its own part of shared memory. - Reducing the grid size from 64 blocks to 32 blocks seemed to solve the issue (with 32 threads per block). I do not understand why.
为了了解正在发生的事情,我使用了一些更简单的内核进行了测试.
让我向您展示一个产生此类错误的内核示例.基本上,此内核使用共享内存的SIZE_X*SIZE_Y*NTHREADS*sizeof(float)
B,每个SM可以使用48KB共享内存.
In order to understand what was happening, I tested with some simpler kernels.
Let me show you an example of a kernel that creates that kind of error. Basically, this kernel uses SIZE_X*SIZE_Y*NTHREADS*sizeof(float)
B of shared memory, and I can use 48KB of shared memory per SM.
test.cu
template <unsigned int NTHREADS>
__global__ void kernel_test()
{
const int SIZE_X = 4;
const int SIZE_Y = 4;
__shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];
for (unsigned int i = 0; i < SIZE_X; i++)
for (unsigned int j = 0; j < SIZE_Y; j++)
tmp[i][j][threadIdx.x] = threadIdx.x;
}
int main()
{
const unsigned int NTHREADS = 32;
//kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
kernel_test<NTHREADS><<<64, NTHREADS>>>();
cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}
编译:
nvcc test.cu --ptxas-options=-v -o test
如果我们运行内核:
cuda-memcheck --tool racecheck test
-
kernel_test<32><<<32, 32>>>();
:32个块,32个线程=>不会导致任何明显的比赛检查错误. -
kernel_test<32><<<64, 32>>>();
:64个块,32个线程=>会导致WAW危害(threadId.x = 32 ?!)和错误.
kernel_test<32><<<32, 32>>>();
: 32 blocks, 32 threads => does not lead to any apparent racecheck error.kernel_test<32><<<64, 32>>>();
: 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.
==========错误:在块(57,0,0)中的共享 0x6处检测到潜在的WAW危险:
==========在.... h:403:void kernel_test(void)的0x00000048处写入线程(0,0,0)
==========在.... h:403:void kernel_test(void)的0x00000048处写入线程(1、0、0)
==========当前值:0,传入值:128
========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128
==========信息:(正在写入相同的数据)在块(47,0,0)的共享 0x0处检测到潜在的WAW危险:
==========在.... h:403:void kernel_test(void)的0x00000048处写入线程(32,0,0)
==========在.... h:403:void kernel_test(void)的0x00000048处写入线程(0,0,0)
==========当前值:0,传入值:0
========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0
那么我在这里想念什么?我在共享内存上做错什么了吗? (我仍然是初学者)
So what am I missing here? Am I doing something wrong with shared memory? (I am still a beginner with this)
问题似乎是由于NBLOCKS > 32
引起的.为什么会这样?
The problem seems to be coming from cudaDeviceSynchronize()
when NBLOCKS > 32
. Why is this happening?
推荐答案
这显然是NVIDIA Linux驱动程序中的错误.该错误在313.18版本发布后消失了.
This was apparently a bug in NVIDIA drivers for Linux. The bug disappeared after the 313.18 release.
这篇关于CUDA racecheck,共享内存阵列和cudaDeviceSynchronize()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!