CUDA:__syncthreads()里面的if语句 [英] CUDA: __syncthreads() inside if statements

查看:193
本文介绍了CUDA:__syncthreads()里面的if语句的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于CUDA同步的问题。特别是,我需要一些澄清关于if语句的同步。我的意思是,如果我把__syncthreads()在一个if语句的范围内的一小部分线程内的块,发生了什么?我认为一些线程将保持永远等待其他线程不会到达同步点。所以,我写了并执行了一些示例代码来检查:

  __ global__ void kernel(float * vett,int n)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int gridSize = blockDim.x * gridDim.x;

while(index< n)
{
vett [index] = 2;
if(threadIdx.x< 10)
{
vett [index] = 100;
__syncthreads();
}
__syncthreads();

index + = gridSize;
}
}

令人惊讶的是,我观察到输出是一个漂亮的normal(64个元素,blocksize 32):

  100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

所以我用以下方式轻微修改了我的代码:

  __ global__ void kernel(float * vett,int n)
{
int index = blockIdx.x * blockDim.x + threadIdx。 X;
int gridSize = blockDim.x * gridDim.x;

while(index< n)
{
vett [index] = 2;
if(threadIdx.x< 10)
{
vett [index] = 100;
__syncthreads();
}
__syncthreads();
vett [index] = 3;
__syncthreads();

index + = gridSize;
}
}

输出为:

  3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

同样,我错了:我认为if语句中的线程在修改向量的元素之后,将保持在等待状态,并且永远不会离开if范围。
所以...你能澄清发生了什么吗?在同步点之后获取的线程是否解除阻塞等待的线程?
如果你需要重现我的情况,我使用CUDA工具包5.0 RC与SDK 4.2。非常感谢。

解决方案

总之,行为是未定义。所以它有时可以做你想要的,或者它可能不,或者(很可能)只是挂起或崩溃你的内核。



如果你真的很好奇,在内部,你需要记住线程不会独立执行,而是一次一个warp(32个线程的组)。



这当然会创建一个问题的条件分支其中条件不在整个翘曲中均匀地评估。问题是通过执行两个路径,一个接一个,每个与那些线程被禁用,不应该执行该路径解决。 IIRC在现有硬件上首先采用分支,然后在不采取分支的情况下执行路径,但是这种行为是未定义的,因此不能保证。


$ b $这种路径的单独执行持续到某个点,对于该点,编译器可以确定它被保证由两个单独的执行路径(再融合点或同步点)的所有线程到达。当第一代码路径的执行到达这一点时,它停止,并且代之以执行第二代码路径。当第二个路径到达同步点时,所有线程再次被启用,并且执行从那里均匀地继续。



如果在同步之前遇到另一个条件分支,情况变得更复杂。这个问题通过仍然需要执行的堆栈路径解决(幸运的是堆栈的增长是有限的,因为我们一个warp最多可以有32个不同的代码路径)。



插入同步点的地方是 undefined ,甚至在架构之间略有不同,因此再也没有保证。你将从Nvidia获得的唯一(非官方)评论是编译器很好找到最佳同步点。



现在为了理解__syncthreads()的行为,我们可以看到,指令(它转换为PTX中的 bar.sync 指令),重要的是要意识到,该指令不是每个线程执行的,而是对于整个线程束同时执行是否任何线程被禁用),因为只有块的warp需要同步。 warp的线程已经同步执行,并且进一步的同步将不起作用(如果所有线程都被启用),或者在尝试同步来自不同条件代码路径的线程时导致死锁。



您可以按照您的方式从这个描述到您的特定代码段的行为。但请记住,这些都是未定义的,因此无法保证,而且依赖特定的行为可能会在任何时候破坏您的代码。



您可能需要查看 PTX手册的一些更多详细信息,特别是对于 __ syncthreads()的 bar.sync 指令。 Henry Wong的通过Microbenchmarking解析GPU微体系结构论文,由ahmad在下面引用,也是值得一读的。即使现在已经过时的架构和CUDA版本,关于条件分支和 __ syncthreads()的部分似乎仍然一般有效。


I have a question about CUDA synchronizing. Particularly, I need some clarification about synchronizing in if statements. I mean, if I put a __syncthreads() under the scope of an if statement hit by a fraction of the threads inside the block, what happens? I thought that some threads will remain "forever" waiting for the other threads that won't hit the synchronizing point. So, I wrote and executed some sample code to inspect:

__global__ void kernel(float* vett, int n)
{
    int index = blockIdx.x*blockDim.x + threadIdx.x;
    int gridSize = blockDim.x*gridDim.x;

    while( index < n )
    {   
        vett[index] = 2;
        if(threadIdx.x < 10)
        {
            vett[index] = 100;
            __syncthreads();
        }
        __syncthreads();

        index += gridSize;
    }
}

Surprisingly enough, I observed that the output was a pretty "normal" (64 elements, blocksize 32):

100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

So I modified slightly my code in the following way:

__global__ void kernel(float* vett, int n)
{
    int index = blockIdx.x*blockDim.x + threadIdx.x;
    int gridSize = blockDim.x*gridDim.x;

    while( index < n )
    {   
        vett[index] = 2;
        if(threadIdx.x < 10)
        {
            vett[index] = 100;
            __syncthreads();
        }
        __syncthreads();
            vett[index] = 3;
        __syncthreads();

        index += gridSize;
    }
}

And the output was:

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 

Again, I was wrong: I thought that the threads inside the if statement, after modifying the element of the vector, would remain in a wait state and never get out of the if scope. So... could you please clarify what happened? Does a thread that gets after a synchronizing point unblock the threads waiting at the barrier? If you need to reproduce my situation, I used CUDA Toolkit 5.0 RC with SDK 4.2. Thanks a lot in advance.

解决方案

In short, the behavior is undefined. So it may sometimes do what you want, or it may not, or (quite likely) will just hang or crash your kernel.

If you are really curious how things work internally, you need to remember that threads do not execute independently, but a warp (group of 32 threads) at a time.

This of course creates a problem with conditional branches where the conditional does not evaluate uniformly throughout the warp. The problem is solved by execution both paths, one after the other, each with those threads disabled that are not supposed to execute that path. IIRC on existing hardware the branch is taken first, then the path is executed where the branch is not taken, but this behavior is undefined and thus not guaranteed.

This separate execution of paths continues up to some point for which the compiler can determine that it is guaranteed to be reached by all threads of the two separate execution paths (the "reconvergence point" or "synchronization point"). When execution of the first code path reaches this point, it is stopped and the second code path is executed instead. When the second path reaches the synchronization point, all threads are enabled again and execution continues uniformly from there.

The situation gets more complicated if another conditional branch is encountered before the synchronization. This problem is solved with a stack of paths that still need to be executed (luckily the growth of the stack is limited as we can have at most 32 different code paths for one warp).

Where the synchronization points are inserted is undefined and even varies slightly between architectures, so again there are no guarantees. The only (unofficial) comment you will get from Nvidia is that the compiler is pretty good at finding optimal synchronization points. However there are often subtle issues that may move the optimal point further down than you might expect, particularly if threads exit early.

Now to understand the behavior of the __syncthreads() directive, (which translates into a bar.sync instruction in PTX) it is important to realize that this instruction is not executed per thread, but for the whole warp at once (regardless of whether any threads are disabled or not) because only the warps of a block need to be synchronized. The threads of a warp are already executing in sync, and further synchronization will either have no effect (if all threads are enabled) or lead to a deadlock when trying to sync the threads from different conditional code paths.

You can work your way from this description to how your particular piece of code behaves. But keep in mind that all this is undefined, there are no guarantees, and relying on a specific behavior may break your code at any time.

You may want to look at the PTX manual for some more details, particularly for the bar.sync instruction that __syncthreads() compiles to. Henry Wong's "Demystifying GPU Microarchitecture through Microbenchmarking" paper, referenced below by ahmad, is also well worth reading. Even though for now outdated architecture and CUDA version, the sections about conditional branching and __syncthreads() appear to still be generally valid.

这篇关于CUDA:__syncthreads()里面的if语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆