CUDA程序中的比赛条件 [英] Race Condition in CUDA programs

查看:194
本文介绍了CUDA程序中的比赛条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两段代码.一个用C编写,相应的操作用CUDA编写. 请帮助我了解__syncthreads()在以下程序的上下文中的工作方式.据我了解,__syncthreads()确保将线程限制在一个块之内.

I have two pieces of code. One written in C and the corresponding operation written in CUDA. Please help me understand how __syncthreads() works in context of the following programs. As per my understanding, __syncthreads() ensures synchronization of threads limited to one block.

C程序:

{
    for(i=1;i<10000;i++)
    {
        t=a[i]+b[i];
        a[i-1]=t;
    }
}

`

等效的CUDA程序:

The equivalent CUDA program : `

__global__ void kernel0(int *b, int *a, int *t, int N)
{
    int b0=blockIdx.x;
    int t0=threadIdx.x;
    int tid=b0*blockDim.x+t0;
    int private_t;
    if(tid<10000)
    {
        private_t=a[tid]+b[tid];
        if(tid>1)
            a[tid-1]=private_t;
        __syncthreads();
        if(tid==9999)
        *t=private_t;
    }
}

内核尺寸:

dim3 k0_dimBlock(32);
dim3 k0_dimGrid(313);
kernel0 <<<k0_dimGrid, k0_dimBlock>>>

令人惊讶的事实是C和CUDA程序的输出是相同的.给定问题的性质,它具有a []对其自身的依赖性,则a [i]由thrad-ID i加载,并由同一线程写入a [i-1].现在,线程ID i-1也会发生同样的情况.如果问题大小小于32,则输出是显而易见的.但是对于大小为10000且包含313个块的问题,如何尊重依赖关系?

The surprising fact is output from C and CUDA program are identical. Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thrad-ID i and written to a[i-1] by the same thread. Now the same happens for thread-ID i-1. Had the problem size been lesser than 32, the output is obvious. But for a problem of size 10000 with 313 blocks and blocks, how does the dependency gets respected ?

推荐答案

据我了解, __ syncthreads()确保同步 线程限制为一个块.

As per my understanding, __syncthreads() ensures synchronization of threads limited to one block.

您是对的. __syncthreads()是块上下文中的同步屏障.因此,这很有用,例如,当您必须确保在开始算法的下一阶段之前确保所有数据都已更新时.

You're right. __syncthreads() is a synchronization barrier in the context of a block. Therefore, it is useful, for instance, when you must to ensure that all your data is updated before starting the next stage of your algorithm.

鉴于问题的性质,它具有a []对自身的依赖性, 通过线程ID i加载a [i],并通过同一线程将其写入a [i-1].

Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thread-ID i and written to a[i-1] by the same thread.

想象一下线程2到达了if语句,因为它与它输入到该语句的条件相匹配.现在,这些线程将执行以下操作:

Just imagine the thread 2 reach the if statement, since it matches the condition it enters to the statement. Now that threads do the following:

private_t=a[2]+b[2];
a[1]=private_t;

女巫等效于:

a[1]=a[2]+b[2];

正如您所指出的,它是对数组a的数据依赖性.由于您有时无法控制变形的执行顺序,因此将使用a数组的更新版本.在我看来,您需要添加一条额外的__syncthreads()语句:

As you pointed, it is data dependency on array a. Since you can't control the order of execution of the warps at some point you'll be using an updated version of the aarray. In my mind, you need to add an extra __syncthreads() statement:

if( tid > 0 && tid<10000)
{
    private_t=a[tid]+b[tid];
    __syncthreads();
    a[tid-1]=private_t;
    __syncthreads();
    if(tid==9999)
    *t=private_t;
}

这样,每个线程都使用原始数组a获得其自己的private_t变量版本,然后并行更新该数组.

In this way, every thread gets its own version of private_t variable using the original array a, then the array is updated in parallel.

关于* t值:

如果仅查看*t的值,则不会注意到此随机调度的影响,具体取决于启动参数,这是因为tid==9999的线程可能与线程tid==9998.由于创建private_t值需要两个数组位置,并且您已经具有同步障碍,因此答案应该是正确的

If you're only looking at the value of *t, you'll not notice the effect of this random scheduling depending on the launching parameters, that's because the thread with tid==9999 could be in the last warp along with the thread tid==9998. Since the two array positions needed to create the private_t value and you already had that synchronization barrier the answer should be right

这篇关于CUDA程序中的比赛条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆