CUDA:如何使用barrier.sync [英] CUDA: how to use barrier.sync

查看:242
本文介绍了CUDA:如何使用barrier.sync的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读 https://docs.nvidia.com/cn/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar ,其中详细介绍了PTX同步功能。

I have read https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar which details about PTX synchronization function.


  1. 它说有16个屏障逻辑资源,您可以使用参数 a指定要使用的屏障。什么是屏障逻辑资源?

  1. It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter "a". What is a barrier logical resource?

我从外部来源获得了一段代码,据我所知。但是,我无法理解 asm内部使用的语法以及内存的作用。我假设名称替换%0, numThreads替换%1,但是内存是什么,冒号在做什么?

I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside "asm" and what "memory" does. I assume "name" replaces "%0" and "numThreads" replace "%1", but what is "memory" and what are the colons doing?

__device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}


  • 在256个线程的块中,我只希望线程64〜127进行同步。 barrier.sync
    函数可以吗? (例如,假设我有一个1个块的网格,一个256个线程的块。我们将该块分成3个条件分支,st线程0〜63进入内核1,线程64〜127进入内核2,线程128〜255进入内核3。我希望内核2中的线程仅相互同步。因此,如果我使用上面定义的 namedBarrierSync函数: namedBarrierSync(1,64),那么它仅同步64〜127线程或线程0〜63?

  • In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with barrier.sync function? ( for an example, say I have a grid of 1 block, block of 256 threads. we split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the "namedBarrierSync" function defied above: "namedBarrierSync( 1, 64)". Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?

    我已经使用以下代码进行了测试(假设gpuAssert是文件中定义的错误检查功能)。

    I have tested with below code ( assume that gpuAssert is an error checking function defined somewhere in the file ).

    这是代码:

    __global__ void test(int num_threads) 
    {
        if (threadIdx.x >= 64 && threadIdx.x < 128) 
        {
            namedBarrierSync(0, num_threads) ;
        }
        __syncthreads();
    }
    
    int main(void) 
    {
        test<<<1, 1, 256>>>(128);
        gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE_);
        printf("complete\n");
        return 1;
    }
    


    推荐答案


    1. 屏障逻辑资源是同步线程块中的线程/线程(可能是原子计数器等)所必需的硬件。您不需要知道对它们进行编程的实际硬件实现,就足以知道它们有16个实例。

    2. 正如Robert Crovella在您的交叉发布在Nvidia论坛上,内联PTX的文档位于 https:// docs .nvidia.com / cuda / inline-ptx-assembly / index.html

    3. barrier.sync $ 64的命名屏障和线程计数会同步到达命名屏障的前两个扭曲(对于高达6.x的计算能力)或到达命名屏障的前64个线程(对于7.0及更高版本的计算能力)进行同步。

    4. 您的测试仅启动一个线程(为该线程分配了256个字节的共享内存),这使得同步指令的测试讨论。您想以 test<<< 1、256>>>(128); 来启动测试内核。

    1. "barrier logical resource" are the hardware necessary to synchronize threads/warps in a thread block (probably atomic counters etc.). You don't need to know the actual hardware implementation to program them, it is sufficient to know there are 16 instances of them available.
    2. As Robert Crovella has pointed out in your cross-post on the Nvidia forum, the documentation for inline PTX is at https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html.
    3. barrier.sync with a named barrier and thread count of 64 synchronizes the first two warps arriving at the named barrier (for compute capability up to 6.x) or the first 64 threads arriving at the named barrier (for compute capability 7.0 onwards).
    4. Your test only launches a single thread (with 256 bytes of shared memory allocated to it), which makes tests of synchronisation instructions moot. You want to launch the test kernel as test<<<1, 256>>>(128); instead.

    这篇关于CUDA:如何使用barrier.sync的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆