挣扎与直觉关于扭曲同步线程执行工作 [英] Struggling with intuition regarding how warp-synchronous thread execution works

查看:231
本文介绍了挣扎与直觉关于扭曲同步线程执行工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA的新用户。我正在工作基本的并行算法,如还原,以了解线程执行是如何工作的。我有以下代码:

I am new in CUDA. I am working basic parallel algorithms, like reduction, in order to understand how thread execution is working. I have the following code:

__global__ void
Reduction2_kernel( int *out, const int *in, size_t N )
{
    extern __shared__ int sPartials[];
    int sum = 0;
    const int tid = threadIdx.x;
    for ( size_t i = blockIdx.x*blockDim.x + tid;
          i < N;
          i += blockDim.x*gridDim.x ) {
        sum += in[i];
    }
    sPartials[tid] = sum;
    __syncthreads();

    for ( int activeThreads = blockDim.x>>1;
              activeThreads > 32;
              activeThreads >>= 1 ) {
        if ( tid < activeThreads ) {
            sPartials[tid] += sPartials[tid+activeThreads];
        }
        __syncthreads();
    }
    if ( threadIdx.x < 32 ) {
        volatile int *wsSum = sPartials;
        if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this statement, any exampele please?
        wsSum[tid] += wsSum[tid + 16];  //how these statements are executed in paralle within a warp
        wsSum[tid] += wsSum[tid + 8];
        wsSum[tid] += wsSum[tid + 4];
        wsSum[tid] += wsSum[tid + 2];
        wsSum[tid] += wsSum[tid + 1];
        if ( tid == 0 ) {
            volatile int *wsSum = sPartials;// why this statement is needed?
            out[blockIdx.x] = wsSum[0];
        }
    }
}

不幸的是,我如何从 if(threadIdx.x< 32)条件和之后的代码工作。有人可以给出一个直观的例子,线程ID和如何执行语句?我认为重要的是要了解这些概念,所以任何帮助,这将是有帮助的!

Unfortunately it is not clear to me how the code is working from the if ( threadIdx.x < 32 )condition and after. Can somebody give an intuitive example with thread ids and how the statements are executed? I think it is important to understand these conceptes so any help it would be helpful!!

推荐答案

让我们看下面的代码块,并回答你的问题:

Let's look at the code in blocks, and answer your questions along the way:

int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = blockIdx.x*blockDim.x + tid;
      i < N;
      i += blockDim.x*gridDim.x ) {
    sum += in[i];
}

上述代码通过大小 N 。我们可以为了理解的目的假设是, N > blockDim.x * gridDim.x 是网格中的线程总数。由于 N 大于总线程,因此每个线程都会对数据集中的多个元素求和。从给定线程的角度来看,它是由线程的网格维度间隔的元素( blockDim.x * gridDim.x )。每个线程将它的sum存储在本地(可能注册)变量 sum

The above code travels through a data set of size N. An assumption we can make for understanding purposes is that N > blockDim.x*gridDim.x, this last term simply being the total number of threads in the grid. Since N is larger than the total threads, each thread is summing multiple elements from the data set. From the standpoint of a given thread, it is summing elements that are spaced by the grid dimension of threads (blockDim.x*gridDim.x) Each thread stores it's sum in a local (presumably register) variable named sum.

sPartials[tid] = sum;
__syncthreads();

随着每个线程完成(即,因为它的for循环超过 N )它存储它在共享内存中的 sum ,然后等待块中的所有其他线程完成。

As each thread finishes (i.e., as it's for-loop exceeds N) it stores it's intermediate sum in shared memory, and then waits for all other threads in the block to finish.

for ( int activeThreads = blockDim.x>>1;
          activeThreads > 32;
          activeThreads >>= 1 ) {
    if ( tid < activeThreads ) {
        sPartials[tid] += sPartials[tid+activeThreads];
    }
    __syncthreads();
}

到目前为止,我们还没有讨论过块的尺寸 - 有意义。让我们假设每个块有32个线程的整数倍。下一步将开始将存储在共享存储器中的各种中间和收集到更小和更小的变量组中。上面的代码通过选择线程块中的一半线程( blockDim.x>> 1 )开始,并使用这些线程中的每一个组合两个部分和共享内存。因此,如果我们的线程程序从128个线程开始,我们只使用其中的64个线程将128个部分和减少为64个部分和。这个过程在for循环中重复进行,每次将线程切成两半,并组合部分和,每个线程一次一个。这个过程继续,只要 activeThreads > 32.所以如果 activeThreads 是64,那么这64个线程将128个部分总和为64个部分和。但是当 activeThreads 变为32时,循环结束,而不将64个部分和合并到32中。因此,在完成这个代码块,我们采取了(32个线程的任意倍数)线程块,并且减少了我们开始的许多部分和,直到64.这个将256个部分和,128个部分和与64个部分和组合的过程必须等待在每次迭代时,对于所有线程(在多个warp中)完成它们的工作,因此 __ syncthreads(); 语句用

So far we haven't talked about the dimension of the block - it hasn't mattered. Let's assume each block has some integer multiple of 32 threads. The next step will be to start gathering the various intermediate sums stored in shared memory into smaller and smaller groups of variables. The above code starts out by selecting half of the threads in the threadblock (blockDim.x>>1) and uses each of those threads to combine two of the partial sums in shared memory. So if our threadblock started out at 128 threads, we just used 64 of those threads to reduce 128 partial sums into 64 partial sums. This process continues repetetively in the for loop, each time cutting the threads in half and combining partial sums, two at a time per thread. This process continues as long as activeThreads > 32. So if activeThreads is 64, then those 64 threads will combine 128 partial sums into 64 partial sums. But when activeThreads becomes 32, the for-loop is terminated, without combining 64 partial sums into 32. So at the completion of this block of code, we have taken the (arbitrary multiple of 32 threads) threadblock, and reduced however many partial sums we started out with, down to 64. This process of combining say 256 partial sums, to 128 partial sums, to 64 partial sums, must wait at each iteration for all threads (in multiple warps) to complete their work, so the __syncthreads(); statement is executed with each pass of the for-loop.

请记住,在这一点上,我们已经将我们的threadblock减少到64个部分和。

Keep in mind, at this point, we have reduced our threadblock to 64 partial sums.

if ( threadIdx.x < 32 ) {

对于这一点之后的内核,我们将只使用前32个线程(即第一个warp)。所有其他线程将保持空闲。注意,在这一点之后没有 __ syncthreads(); ,因为这将违反使用它的规则(所有线程必须参与 __ syncthreads(); )。

For the remainder of the kernel after this point, we will only be using the first 32 threads (i.e. the first warp). All other threads will remain idle. Note that there are no __syncthreads(); after this point either, as that would be a violation of the rule for using it (all threads must participate in a __syncthreads();).

    volatile int *wsSum = sPartials;

我们现在创建一个 volatile 共享内存。理论上,这告诉编译器它不应该做各种优化,例如优化一个特定的值到寄存器,例如。为什么我们以前不需要这个?因为 __ syncthreads(); 还带有内存防护功能。 A __ syncthreads(); 调用,除了使所有线程彼此等待在屏障之外,还强制所有线程更新回共享或全局内存。我们不能再依赖这个特性,因为从这里开始,我们不会使用 __ syncthreads(); ,因为我们已经限制了自己 - kernel-to a single warp。

We are now creating a volatile pointer to shared memory. In theory, this tells the compiler that it should not do various optimizations, such as optimizing a particular value into a register, for example. Why didn't we need this before? Because __syncthreads(); also carries with it a memory-fencing function. A __syncthreads(); call, in addition to causing all threads to wait at the barrier for each other, also forces all thread updates back into shared or global memory. We can no longer depend on this feature, however, because from here on out we will not be using __syncthreads(); because we have restricted ourselves -- for the remainder of the kernel -- to a single warp.

    if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this

上面的缩减块给我们留下64个部分和。但是我们在这一点上限制了32个线程。因此,我们必须再做一个组合,将64个部分和汇集成32个部分和,然后才能继续余下的减少。

The previous reduction block left us with 64 partial sums. But we have at this point restricted ourselves to 32 threads. So we must do one more combination to gather the 64 partial sums into 32 partial sums, before we can proceed with the remainder of the reduction.

    wsSum[tid] += wsSum[tid + 16];  //how these statements are executed in paralle within a warp

现在我们终于进入了一些warp-同步编程。这行代码取决于32个线程在锁步中执行的事实。为了理解为什么(以及它是如何工作的),将它拆分为完成这行代码所需的操作序列将是方便的。它看起来像:

Now we are finally getting into some warp-synchronous programming. This line of code depends on the fact that 32 threads are executing in lockstep. To understand why (and how it works at all) it will be convenient to break this down into the sequence of operations needed to complete this line of code. It looks something like:

    read the partial sum of my thread into a register
    read the partial sum of the thread that is 16 higher than my thread, into a register
    add the two partial sums
    store the result back into the partial sum corresponding to my thread

在锁步中,所有32个线程都将遵循上述顺序。所有32个线程将通过将 wsSum [tid] 读入(线程局部)寄存器来开始。这意味着线程0读取 wsSum [0] ,线程1读取 wsSum [1] 读取另一个部分和到一个不同的寄存器:thread 0读取 wsSum [16] ,线程1读取 wsSum [17] 等。这是真的,我们不关心 wsSum [32] (和更高的)值;我们已经将它们折叠成第一个32 wsSum [] 值。然而,我们将看到,只有前16个线程(在这一步)将有助于最终结果,所以前16个线程将32个部分和合并为16.接下来的16个线程也将起作用,

All 32 threads will follow the above sequence in lock-step. All 32 threads will begin by reading wsSum[tid] into a (thread-local) register. That means thread 0 reads wsSum[0], thread 1 reads wsSum[1] etc. After that, each thread reads another partial sum into a different register: thread 0 reads wsSum[16], thread 1 reads wsSum[17], etc. It's true that we don't care about the wsSum[32](and higher) values; we've already collapsed those into the first 32 wsSum[] values. However, as we'll see, only the first 16 threads (at this step) will contribute to the final result, so the first 16 threads will be combining the 32 partial sums into 16. The next 16 threads will be acting as well, but they are just doing garbage work -- it will be ignored.

上述步骤将32个部分和合并到 wsSum [16]中的前16个位置, ] 。下一行代码:

The above step combined 32 partial sums into the first 16 locations in wsSum[]. The next line of code:

    wsSum[tid] += wsSum[tid + 8];

重复此过程,粒度为8.此外,所有32个线程都处于活动状态,顺序是这样的:

repeats this process with a granularity of 8. Again, all 32 threads are active, and the micro-sequence is something like this:

    read the partial sum of my thread into a register
    read the partial sum of the thread that is 8 higher than my thread, into a register
    add the two partial sums
    store the result back into the partial sum corresponding to my thread

所以前8个线程将前16个部分和( wsSum [0..15] )组合成8部分和(包含在 wsSum [0..7] )。接下来的8个线程还将 wsSum [8..23] 组合到 wsSums [8..15] em>,但之后,在线程0..8读取这些值后,写入8..15会发生在之后,因此有效数据不会损坏。这只是额外的垃圾工作。同样地,对于经线内的8个线程的其他块。所以在这一点上,我们将感兴趣的部分和合并到8个位置。

So the first 8 threads combine the first 16 partial sums (wsSum[0..15]) into 8 partial sums (contained in wsSum[0..7]). The next 8 threads are also combining wsSum[8..23] into wsSums[8..15], but the writes to 8..15 occur after those values were read by threads 0..8, so the valid data is not corrupted. It's just extra junk work going on. Likewise for the other blocks of 8 threads within the warp. So at this point we have combined the partial sums of interest into 8 locations.

    wsSum[tid] += wsSum[tid + 4];  //this combines partial sums of interest into 4 locations
    wsSum[tid] += wsSum[tid + 2];  //this combines partial sums of interest into 2 locations
    wsSum[tid] += wsSum[tid + 1];  //this combines partial sums of interest into 1 location

这些代码行遵循类似的模式前两个,将warp划分为8组4个线程(只有第一个4线程组对最终结果有贡献),然后将warp划分为16组2个线程,只有第一个2线程组对最后结果。最后,分成32组,每组1个线程,每个线程产生一个部分和,只有第一部分和是感兴趣的。

And these lines of code follow a similar pattern as the previous two, partitioning the warp into 8 groups of 4 threads (only the first 4-thread group contributes to the final result) and then partitioning the warp into 16 groups of 2 threads, with only the first 2-thread group contributing to the final result. And finally, into 32 groups of 1 thread each, each thread generating a partial sum, with only the first partial sum being of interest.

    if ( tid == 0 ) {
        volatile int *wsSum = sPartials;// why this statement is needed?
        out[blockIdx.x] = wsSum[0];
    }

最后,在上一步中,单个值。现在是时候把这个单值写出到全局内存。我们做了减少吗?也许,但可能不是。如果上面的内核只有一个threadblock,那么我们就完成了 - 我们的最终partial和实际上是数据集中所有元素的总和。但是如果我们启动了多个块,那么每个块的最终结果仍然是一个部分和,所有块的结果必须加在一起(以某种方式)。

At last, in the previous step, we had reduced all partial sums down to a single value. It's now time to write that single value out to global memory. Are we done with the reduction? Perhaps, but probably not. If the above kernel were launched with only 1 threadblock, then we would be done -- our final "partial" sum is in fact the sum of all elements in the data set. But if we launched multiple blocks, then the final result from each block is still a "partial" sum, and the results from all blocks must be added together (somehow).


我不知道为什么需要这个语句。

I don't know why that statement is needed.

我的猜测是,它是从还原内核的上一次迭代左侧,而程序员忘记删除它,或没有注意到它不需要。也许别人会知道这个答案。

My guess is that it was left around from a previous iteration of the reduction kernel, and the programmer forgot to delete it, or didn't notice that it wasn't needed. Perhaps someone else will know the answer to this one.

最后, cuda reduce sample 提供了非常好的参考代码进行研究,以及伴随的 pdf文档很好地描述了沿途可以进行的优化。

Finally, the cuda reduction sample provides very good reference code for study, and the accompanying pdf document does a good job of describing the optimizations that can be made along the way.

这篇关于挣扎与直觉关于扭曲同步线程执行工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆