CUDA:如何使用 barrier.sync [英] CUDA: how to use barrier.sync

查看:6
本文介绍了CUDA:如何使用 barrier.sync的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

I have read https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar which details about PTX synchronization function.

  1. It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter "a". What is a barrier logical resource?

  2. I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside "asm" and what "memory" does. I assume "name" replaces "%0" and "numThreads" replace "%1", but what is "memory" and what are the colons doing?

    __device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
    asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}
    

  3. In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with barrier.sync function? ( for an example, say I have a grid of 1 block, block of 256 threads. we split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the "namedBarrierSync" function defied above: "namedBarrierSync( 1, 64)". Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?

  4. I have tested with below code ( assume that gpuAssert is an error checking function defined somewhere in the file ).

Here is the code:

__global__ void test(int num_threads) 
{
    if (threadIdx.x >= 64 && threadIdx.x < 128) 
    {
        namedBarrierSync(0, num_threads) ;
    }
    __syncthreads();
}

int main(void) 
{
    test<<<1, 1, 256>>>(128);
    gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE_);
    printf("complete
");
    return 1;
}

解决方案

  1. "barrier logical resource" are the hardware necessary to synchronize threads/warps in a thread block (probably atomic counters etc.). You don't need to know the actual hardware implementation to program them, it is sufficient to know there are 16 instances of them available.
  2. As Robert Crovella has pointed out in your cross-post on the Nvidia forum, the documentation for inline PTX is at https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html.
  3. barrier.sync with a named barrier and thread count of 64 synchronizes the first two warps arriving at the named barrier (for compute capability up to 6.x) or the first 64 threads arriving at the named barrier (for compute capability 7.0 onwards).
  4. Your test only launches a single thread (with 256 bytes of shared memory allocated to it), which makes tests of synchronisation instructions moot. You want to launch the test kernel as test<<<1, 256>>>(128); instead.

这篇关于CUDA:如何使用 barrier.sync的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆