CUDA：如何使用barrier.sync [英] CUDA: how to use barrier.sync

查看：242 发布时间：2020/9/27 19:55:56 c++11 cuda

本文介绍了CUDA：如何使用barrier.sync的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已阅读 https://docs.nvidia.com/cn/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar ，其中详细介绍了PTX同步功能。

I have read https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar which details about PTX synchronization function.

它说有16个屏障逻辑资源，您可以使用参数 a指定要使用的屏障。什么是屏障逻辑资源？

It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter "a". What is a barrier logical resource?

我从外部来源获得了一段代码，据我所知。但是，我无法理解 asm内部使用的语法以及内存的作用。我假设名称替换％0， numThreads替换％1，但是内存是什么，冒号在做什么？

I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside "asm" and what "memory" does. I assume "name" replaces "%0" and "numThreads" replace "%1", but what is "memory" and what are the colons doing?

__device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}

在256个线程的块中，我只希望线程64〜127进行同步。 barrier.sync
函数可以吗？（例如，假设我有一个1个块的网格，一个256个线程的块。我们将该块分成3个条件分支，st线程0〜63进入内核1，线程64〜127进入内核2，线程128〜255进入内核3。我希望内核2中的线程仅相互同步。因此，如果我使用上面定义的 namedBarrierSync函数： namedBarrierSync（1，64），那么它仅同步64〜127线程或线程0〜63？

In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with barrier.sync function? ( for an example, say I have a grid of 1 block, block of 256 threads. we split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the "namedBarrierSync" function defied above: "namedBarrierSync( 1, 64)". Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?

我已经使用以下代码进行了测试（假设gpuAssert是文件中定义的错误检查功能）。

I have tested with below code ( assume that gpuAssert is an error checking function defined somewhere in the file ).

这是代码：

__global__ void test(int num_threads) 
{
    if (threadIdx.x >= 64 && threadIdx.x < 128) 
    {
        namedBarrierSync(0, num_threads) ;
    }
    __syncthreads();
}

int main(void) 
{
    test<<<1, 1, 256>>>(128);
    gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE_);
    printf("complete\n");
    return 1;
}

CUDA：如何使用barrier.sync [英] CUDA: how to use barrier.sync

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

CUDA：如何使用barrier.sync [英] CUDA: how to use barrier.sync

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭