CUDA:if 语句中的 __syncthreads() [英] CUDA: __syncthreads() inside if statements

查看:30
本文介绍了CUDA:if 语句中的 __syncthreads()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于 CUDA 同步的问题.特别是,我需要对 if 语句中的同步进行一些说明.我的意思是,如果我将 __syncthreads() 放在块内的一小部分线程命中的 if 语句的范围内,会发生什么?我认为一些线程将保持永远"等待其他不会达到同步点的线程.所以,我编写并执行了一些示例代码来检查:

I have a question about CUDA synchronizing. Particularly, I need some clarification about synchronizing in if statements. I mean, if I put a __syncthreads() under the scope of an if statement hit by a fraction of the threads inside the block, what happens? I thought that some threads will remain "forever" waiting for the other threads that won't hit the synchronizing point. So, I wrote and executed some sample code to inspect:

__global__ void kernel(float* vett, int n)
{
    int index = blockIdx.x*blockDim.x + threadIdx.x;
    int gridSize = blockDim.x*gridDim.x;

    while( index < n )
    {   
        vett[index] = 2;
        if(threadIdx.x < 10)
        {
            vett[index] = 100;
            __syncthreads();
        }
        __syncthreads();

        index += gridSize;
    }
}

令人惊讶的是,我观察到输出非常正常"(64 个元素,块大小 32):

Surprisingly enough, I observed that the output was a pretty "normal" (64 elements, blocksize 32):

100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

所以我通过以下方式稍微修改了我的代码:

So I modified slightly my code in the following way:

__global__ void kernel(float* vett, int n)
{
    int index = blockIdx.x*blockDim.x + threadIdx.x;
    int gridSize = blockDim.x*gridDim.x;

    while( index < n )
    {   
        vett[index] = 2;
        if(threadIdx.x < 10)
        {
            vett[index] = 100;
            __syncthreads();
        }
        __syncthreads();
            vett[index] = 3;
        __syncthreads();

        index += gridSize;
    }
}

输出是:

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 

再次,我错了:我认为 if 语句中的线程,在修改了向量的元素后,会保持等待状态,永远不会超出 if 范围.所以...你能澄清一下发生了什么吗?在同步点之后获取的线程是否会解除阻塞在屏障处等待的线程?如果您需要重现我的情况,我使用了带有 SDK 4.2 的 CUDA Toolkit 5.0 RC.提前非常感谢.

Again, I was wrong: I thought that the threads inside the if statement, after modifying the element of the vector, would remain in a wait state and never get out of the if scope. So... could you please clarify what happened? Does a thread that gets after a synchronizing point unblock the threads waiting at the barrier? If you need to reproduce my situation, I used CUDA Toolkit 5.0 RC with SDK 4.2. Thanks a lot in advance.

推荐答案

简而言之,行为是未定义.所以它有时可能会做你想做的事,也可能不会,或者(很可能)只会挂起或崩溃你的内核.

In short, the behavior is undefined. So it may sometimes do what you want, or it may not, or (quite likely) will just hang or crash your kernel.

如果你真的很好奇事情是如何在内部工作的,你需要记住线程不是独立执行的,而是一次执行一个 warp(32 个线程的组).

If you are really curious how things work internally, you need to remember that threads do not execute independently, but a warp (group of 32 threads) at a time.

这当然会产生条件分支的问题,条件分支在整个经纱中不会统一评估.这个问题是通过执行两条路径来解决的,一个接一个,每个都禁用那些不应该执行该路径的线程.IIRC 在现有硬件上首先采用分支,然后在未采用分支的地方执行路径,但是这种行为是 undefined 的,因此不能保证.

This of course creates a problem with conditional branches where the conditional does not evaluate uniformly throughout the warp. The problem is solved by execution both paths, one after the other, each with those threads disabled that are not supposed to execute that path. IIRC on existing hardware the branch is taken first, then the path is executed where the branch is not taken, but this behavior is undefined and thus not guaranteed.

这种单独的路径执行一直持续到编译器可以确定两个单独执行路径的所有线程都可以到达的某个点(重新收敛点"或同步点").当第一个代码路径的执行到达这一点时,它会停止并执行第二个代码路径.当第二条路径到达同步点时,所有线程再次启用,并从那里统一继续执行.

This separate execution of paths continues up to some point for which the compiler can determine that it is guaranteed to be reached by all threads of the two separate execution paths (the "reconvergence point" or "synchronization point"). When execution of the first code path reaches this point, it is stopped and the second code path is executed instead. When the second path reaches the synchronization point, all threads are enabled again and execution continues uniformly from there.

如果在同步之前遇到另一个条件分支,情况会变得更加复杂.这个问题通过一堆仍然需要执行的路径来解决(幸运的是,堆栈的增长是有限的,因为我们最多可以为一个 warp 拥有 32 个不同的代码路径).

The situation gets more complicated if another conditional branch is encountered before the synchronization. This problem is solved with a stack of paths that still need to be executed (luckily the growth of the stack is limited as we can have at most 32 different code paths for one warp).

插入同步点的位置是未定义,甚至在架构之间略有不同,所以同样不能保证.您将从 Nvidia 获得的唯一(非官方)评论是编译器非常擅长寻找最佳同步点.然而,经常有一些微妙的问题可能会比您预期的更进一步地降低最佳点,尤其是在线程提前退出的情况下.

Where the synchronization points are inserted is undefined and even varies slightly between architectures, so again there are no guarantees. The only (unofficial) comment you will get from Nvidia is that the compiler is pretty good at finding optimal synchronization points. However there are often subtle issues that may move the optimal point further down than you might expect, particularly if threads exit early.

现在要了解 __syncthreads() 指令的行为(在 PTX 中转换为 bar.sync 指令),重要的是要认识到该指令不是每个线程执行的,而是针对一次整个扭曲(无论是否禁用任何线程),因为只有块的扭曲需要同步.warp 的线程已经在同步执行,进一步的同步要么无效(如果所有线程都启用),要么在尝试从不同的条件代码路径同步线程时导致死锁.

Now to understand the behavior of the __syncthreads() directive, (which translates into a bar.sync instruction in PTX) it is important to realize that this instruction is not executed per thread, but for the whole warp at once (regardless of whether any threads are disabled or not) because only the warps of a block need to be synchronized. The threads of a warp are already executing in sync, and further synchronization will either have no effect (if all threads are enabled) or lead to a deadlock when trying to sync the threads from different conditional code paths.

您可以按照自己的方式从这个描述到您的特定代码的行为方式.但请记住,这一切都是未定义,没有任何保证,依赖特定行为可能随时破坏您的代码.

You can work your way from this description to how your particular piece of code behaves. But keep in mind that all this is undefined, there are no guarantees, and relying on a specific behavior may break your code at any time.

您可能需要查看 PTX 手册更多细节,特别是 bar.sync 指令.Henry Wong 的 Demystifying GPU Microarchitecture through Microbenchmarking"论文,下面由 ahmad 引用,是也很值得一读.即使对于现在已经过时的架构和 CUDA 版本,关于条件分支和 __syncthreads() 的部分似乎仍然普遍有效.

You may want to look at the PTX manual for some more details, particularly for the bar.sync instruction that __syncthreads() compiles to. Henry Wong's "Demystifying GPU Microarchitecture through Microbenchmarking" paper, referenced below by ahmad, is also well worth reading. Even though for now outdated architecture and CUDA version, the sections about conditional branching and __syncthreads() appear to still be generally valid.

这篇关于CUDA:if 语句中的 __syncthreads()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆