CUDA程序中的比赛条件 [英] Race Condition in CUDA programs
问题描述
我有两段代码.一个用C编写,相应的操作用CUDA编写.
请帮助我了解__syncthreads()
在以下程序的上下文中的工作方式.据我了解,__syncthreads()
确保将线程限制在一个块之内.
I have two pieces of code. One written in C and the corresponding operation written in CUDA.
Please help me understand how __syncthreads()
works in context of the following programs. As per my understanding, __syncthreads()
ensures synchronization of threads limited to one block.
C程序:
{
for(i=1;i<10000;i++)
{
t=a[i]+b[i];
a[i-1]=t;
}
}
`
等效的CUDA程序:
The equivalent CUDA program : `
__global__ void kernel0(int *b, int *a, int *t, int N)
{
int b0=blockIdx.x;
int t0=threadIdx.x;
int tid=b0*blockDim.x+t0;
int private_t;
if(tid<10000)
{
private_t=a[tid]+b[tid];
if(tid>1)
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
}
内核尺寸:
dim3 k0_dimBlock(32);
dim3 k0_dimGrid(313);
kernel0 <<<k0_dimGrid, k0_dimBlock>>>
令人惊讶的事实是C和CUDA程序的输出是相同的.给定问题的性质,它具有a []对其自身的依赖性,则a [i]由thrad-ID i加载,并由同一线程写入a [i-1].现在,线程ID i-1也会发生同样的情况.如果问题大小小于32,则输出是显而易见的.但是对于大小为10000且包含313个块的问题,如何尊重依赖关系?
The surprising fact is output from C and CUDA program are identical. Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thrad-ID i and written to a[i-1] by the same thread. Now the same happens for thread-ID i-1. Had the problem size been lesser than 32, the output is obvious. But for a problem of size 10000 with 313 blocks and blocks, how does the dependency gets respected ?
推荐答案
据我了解, __ syncthreads()确保同步 线程限制为一个块.
As per my understanding, __syncthreads() ensures synchronization of threads limited to one block.
您是对的. __syncthreads()
是块上下文中的同步屏障.因此,这很有用,例如,当您必须确保在开始算法的下一阶段之前确保所有数据都已更新时.
You're right. __syncthreads()
is a synchronization barrier in the context of a block. Therefore, it is useful, for instance, when you must to ensure that all your data is updated before starting the next stage of your algorithm.
鉴于问题的性质,它具有a []对自身的依赖性, 通过线程ID i加载a [i],并通过同一线程将其写入a [i-1].
Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thread-ID i and written to a[i-1] by the same thread.
想象一下线程2到达了if语句,因为它与它输入到该语句的条件相匹配.现在,这些线程将执行以下操作:
Just imagine the thread 2 reach the if statement, since it matches the condition it enters to the statement. Now that threads do the following:
private_t=a[2]+b[2];
a[1]=private_t;
女巫等效于:
a[1]=a[2]+b[2];
正如您所指出的,它是对数组a
的数据依赖性.由于您有时无法控制变形的执行顺序,因此将使用a
数组的更新版本.在我看来,您需要添加一条额外的__syncthreads()
语句:
As you pointed, it is data dependency on array a
. Since you can't control the order of execution of the warps at some point you'll be using an updated version of the a
array. In my mind, you need to add an extra __syncthreads()
statement:
if( tid > 0 && tid<10000)
{
private_t=a[tid]+b[tid];
__syncthreads();
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
这样,每个线程都使用原始数组a
获得其自己的private_t
变量版本,然后并行更新该数组.
In this way, every thread gets its own version of private_t
variable using the original array a
, then the array is updated in parallel.
关于* t值:
如果仅查看*t
的值,则不会注意到此随机调度的影响,具体取决于启动参数,这是因为tid==9999
的线程可能与线程tid==9998
.由于创建private_t
值需要两个数组位置,并且您已经具有同步障碍,因此答案应该是正确的
If you're only looking at the value of *t
, you'll not notice the effect of this random scheduling depending on the launching parameters, that's because the thread with tid==9999
could be in the last warp along with the thread tid==9998
. Since the two array positions needed to create the private_t
value and you already had that synchronization barrier the answer should be right
这篇关于CUDA程序中的比赛条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!