cuda 块同步 [英] cuda block synchronization

查看:46
本文介绍了cuda 块同步的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 b 个块,每个块有 t 个线程.我可以使用

I have b number of blocks and each block has t number of threads. I can use

 __syncthreads()

同步特定块中的线程.例如

to synchronize the threads that are in a particular block. for example

__global__ void aFunction()
{
    for(i=0;i<10;i++)
    {
       //execute something
        __syncthreads();
    }
}

但我的问题是同步所有块中的所有线程.我该怎么做?

But my problem is to synchronize all the threads in all the blocks. How can I do this?

推荐答案

在 CUDA 9 中,NVIDIA 引入了合作组的概念,允许您同步属于该组的所有线程.这样的组可以跨越网格中的所有线程.这样您就可以同步所有块中的所有线程:

In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks:

#include <cuda_runtime_api.h> 
#include <cuda.h> 
#include <cooperative_groups.h>

cooperative_groups::grid_group g = cooperative_groups::this_grid(); 
g.sync();

您需要 Pascal(计算能力 60)或更新的架构来同步网格.此外,还有更具体的要求.请参阅:https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg

You need a Pascal (compute capability 60) or a newer architecture to synchronize grids. In addition, there are more specific requirements. See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg

所有架构都支持基本功能,例如将小于线程块的组同步到扭曲粒度,而 Pascal 和 Volta GPU 支持新的网格范围和多 GPU 同步组.

Basic functionality, such as synchronizing groups smaller than a thread block down to warp granularity, is supported on all architectures, while Pascal and Volta GPUs enable new grid-wide and multi-GPU synchronizing groups.

来源:https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

在 CUDA 9 之前,没有本地方法可以同步所有块中的所有线程.实际上,CUDA 中的块的概念是,有些可能只有在其他一些块已经结束其工作之后才会启动,例如,如果它运行的 GPU 太弱而无法并行处理它们.

Before CUDA 9, there was no native way to synchronise all threads from all blocks. In fact, the concept of blocks in CUDA is that some may be launched only after some other blocks already ended its work, for example, if the GPU it is running on is too weak to process them all in parallel.

如果您确保不会生成太多块,您可以尝试在它们之间同步所有块,例如通过使用原子操作主动等待.然而,这很慢,吃掉你的 GPU 内存控制器,被认为是黑客".应该避免.

If you ensure that you don't spawn too many blocks, you can try to synchronise all blocks between themselves, e.g. by actively-waiting using atomic operations. This is however slow, eating up your GPU memory controller, is considered "a hack" and should be avoided.

因此,如果您不针对 Pascal(或更新的)架构,我建议的最佳方法是在同步点简单地终止您的内核,然后启动一个新的内核以继续您的工作.在大多数情况下,它实际上会比使用提到的 hack 更快(或至少 - 以类似的速度).

So, if you don't target Pascal (or newer) architecture, the best way that I can suggest is to simply terminate your kernel at the synchronisation point, and then launch a new kernel which would continue with your job. In most circumstances it will actually perform faster (or at least - with simmilar speeds) than using the mentioned hack.

这篇关于cuda 块同步的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆