cuda,虚拟/隐式块同步 [英] cuda, dummy/implicit block synchronization

查看:210
本文介绍了cuda,虚拟/隐式块同步的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道块同步是不可能的,唯一的方法是启动一个新的内核。



但是,假设我启动X块,到我的GPU上的SM的数量。我应该方面,调度程序将分配一个块到每个SM ...对吧?如果GPU被用作辅助图形卡(完全专用于CUDA),这意味着理论上没有其他进程使用它... ...



我的想法如下:隐式同步。



让我们假设有时我只需要一个块,有时我需要所有的X块。那么,在我只需要一个块的情况下,我可以配置我的代码,使第一个块(或第一个SM)将工作在真实数据,而其他X-1块(或SM)伪数据,执行完全相同的指令,只是与一些其他偏移量。



这样所有的人都会继续同步,直到我需要所有的人。



在这种情况下调度程序是否可靠?

解决方案

您有几个问题,所以我将尝试分别解决这些问题


$ b a href =http://www.nvidia.com/content/forums/index.html#entry1426544> nVidia自己的论坛,因为我得到的结果,表明这不是发生了什么。显然,如果块的数量等于SM的数量,块调度器将不会为每个SM分配一个块。



隐式同步 / p>

否。首先,你不能保证每个块都有自己的SM(见上文)。其次,所有块不能同时访问全局存储。



阻止同步



现在好消息:是的,你可以。 CUDA C编程指南第B.11节中描述的原子指令可用于创建屏障。假设您有 N 个阻止在GPU上同时执行。

  __ device__ int barrier = N; 

__global__ void mykernel(){

/ *做这个块做的任何事情。 * /
...

/ *确保此块中的所有线程都在这里。 * /
__syncthreads();

/ *一旦完成,减少屏障的值。 * /
if(threadIdx.x == 0)
atomicSub(& barrier,1);

/ *现在等待屏障为零。 * /
if(threadIdx.x == 0)
while(atomicCAS(& barrier,0,0)!= 0);

/ *确保每个人都等待着障碍。 * /
__syncthreads();

/ *执行任何你想做的事情。 * /
...

}

指令 atomicSub(p,i)原子计算 * p - = i ,并且只由块中的第零个线程调用,即我们只想减少 barrier 一次。指令 atomicCAS(p,c,v)设定 * p = v iff * p == c 并返回 * p 的旧值。这部分循环直到 barrier 达到 0 ,即直到所有块都已经越过它。



注意,你必须在调用 __ synchtreads()时包装这个部分,因为块中的线程不会在严格的锁步中执行,你必须强制他们都等待零的线程。



记住,如果你多次调用你的内核,你应该设置 barrier

>

回复 jHackTheRipper 的答案和的评论,我要指出,你不应该试图不是可以同时安排在GPU上展开更块!这是由若干因素的限制,此时应使用 CUDA占用率计算器,以找到适合您的内核和设备的最大块数。



这是原来的问题来看,虽然只作为有多少算多少块SM时被启动,所以这点是没有实际意义。


I am aware that block sync is not possible, the only way is launching a new kernel.

BUT, let's suppose that I launch X blocks, where X corresponds to the number of the SM on my GPU. I should aspect that the scheduler will assign a block to each SM...right? And if the GPU is being utilized as a secondary graphic card (completely dedicated to CUDA), this means that, theoretically, no other process use it... right?

My idea is the following: implicit synchronization.

Let's suppose that sometimes I need only one block, and sometimes I need all the X blocks. Well, in those cases where I need just one block, I can configure my code so that the first block (or the first SM) will work on the "real" data while the other X-1 blocks (or SMs) on some "dummy" data, executing exactly the same instruction, just with some other offset.

So that all of them will continue to be synchronized, until I am going to need all of them again.

Is the scheduler reliable under this conditions? Or can you be never sure?

解决方案

You've got several questions in one, so I'll try to address them separately.

One block per SM

I asked this a while back on nVidia's own forums, as I was getting results that indicated that this is not what happens. Apparently, the block scheduler will not assign a block per SM if the number of blocks is equal to the number of SMs.

Implicit synchronization

No. First of all, you cannot guarantee that each block will have its own SM (see above). Secondly, all blocks cannot access the global store at the same time. If they run synchronously at all, they will loose this synchronicity as of the first memory read/write.

Block synchronization

Now for the good news: Yes, you can. The atomic instructions described in Section B.11 of the CUDA C Programming Guide can be used to create a barrier. Assume that you have N blocks executing concurrently on your GPU.

__device__ int barrier = N;

__global__ void mykernel ( ) {

    /* Do whatever it is that this block does. */
    ...

    /* Make sure all threads in this block are actually here. */
    __syncthreads();

    /* Once we're done, decrease the value of the barrier. */
    if ( threadIdx.x == 0 )
        atomicSub( &barrier , 1 );

    /* Now wait for the barrier to be zero. */
    if ( threadIdx.x == 0 )
        while ( atomicCAS( &barrier , 0 , 0 ) != 0 );

    /* Make sure everybody has waited for the barrier. */
    __syncthreads();

    /* Carry on with whatever else you wanted to do. */
    ...

    }

The instruction atomicSub(p,i) computes *p -= i atomically and is only called by the zeroth thread in the block, i.e. we only want to decrement barrier once. The instruction atomicCAS(p,c,v) sets *p = v iff *p == c and returns the old value of *p. This part just loops until barrier reaches 0, i.e. until all blocks have crossed it.

Note that you have to wrap this part in calls to __synchtreads() as the threads in a block do not execute in strict lock-step and you have to force them all to wait for the zeroth thread.

Just remember that if you call your kernel more than once, you should set barrier back to N.

Update

In reply to jHackTheRipper's answer and Cicada's comment, I should have pointed out that you should not try to start more blocks than can be concurrently scheduled on the GPU! This is limited by a number of factors, and you should use the CUDA Occupancy Calculator to find the maximum number of blocks for your kernel and device.

Judging by the original question, though, only as many blocks as there are SMs are being started, so this point is moot.

这篇关于cuda,虚拟/隐式块同步的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆