在CUDA块间屏障 [英] Inter-block barrier on CUDA

查看:223
本文介绍了在CUDA块间屏障的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要实现对CUDA一个块间的屏障,但遇到一个严重的问题。

I want to implement a Inter-block barrier on CUDA, but encountering a serious problem.

我想不通为什么它不工作。

I cannot figure out why it does not work.

#include <iostream>
#include <cstdlib>
#include <ctime>

#define SIZE 10000000
#define BLOCKS 100 

using namespace std;

struct Barrier {
    int *count;

    __device__ void wait() {
        atomicSub(count, 1);
        while(*count)
            ;
    }

    Barrier() {
        int blocks = BLOCKS;
        cudaMalloc((void**) &count, sizeof(int));
        cudaMemcpy(count, &blocks, sizeof(int), cudaMemcpyHostToDevice);
    }

    ~Barrier() {
        cudaFree(count);
    }
};


__global__ void sum(int* vec, int* cache, int *sum, Barrier barrier)
{
    int tid = blockIdx.x;

    int temp = 0;
    while(tid < SIZE) {
        temp += vec[tid];
        tid += gridDim.x;
    }

    cache[blockIdx.x] = temp;

    barrier.wait();

    if(blockIdx.x == 0) {
        for(int i = 0 ; i < BLOCKS; ++i)
            *sum += cache[i];
    }
}

int main()
{
    int* vec_host = (int *) malloc(SIZE * sizeof(int));    
    for(int i = 0; i < SIZE; ++i)
        vec_host[i] = 1;

    int *vec_dev;
    int *sum_dev;
    int *cache;
    int sum_gpu = 0;

    cudaMalloc((void**) &vec_dev, SIZE * sizeof(int));
    cudaMemcpy(vec_dev, vec_host, SIZE * sizeof(int), cudaMemcpyHostToDevice);
    cudaMalloc((void**) &sum_dev, sizeof(int));
    cudaMemcpy(sum_dev, &sum_gpu, sizeof(int), cudaMemcpyHostToDevice);
    cudaMalloc((void**) &cache, BLOCKS * sizeof(int));
    cudaMemset(cache, 0, BLOCKS * sizeof(int));

    Barrier barrier;
    sum<<<BLOCKS, 1>>>(vec_dev, cache, sum_dev, barrier);

    cudaMemcpy(&sum_gpu, sum_dev, sizeof(int), cudaMemcpyDeviceToHost);

    cudaFree(vec_dev);
    cudaFree(sum_dev);
    cudaFree(cache);
    free(vec_host);
    return 0;
}

在事实上,即使我重写的wait(),如下面的

In fact, even if I rewrite the wait() as the following

    __device__ void wait() {
        while(*count != 234124)
            ;
    }

该程序正常退出。但我希望得到在这种情况下,一个无限循环。

The program exits normally. But I expect to get an infinite loop in this case.

推荐答案

不幸的是,你想实现(块间通信/同步)什么是CUDA不严格可能。 CUDA编程指南规定,线程块被要求独立地执行:它必须是可以以任何顺序来执行它们,在并联或串联。这样做的原因限制是允许在线程块调度灵活性,并允许code键agnostically与核的数量比例。唯一支持的块间同步方法是启动另一个内核:内核启动(相同的流内)是隐式的同步点

Unfortunately, what you want to achieve (inter-block communication/synchronization) isn't strictly possible in CUDA. The CUDA programming guide states that "thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series." The reason for this restriction is to allow flexibility in the thread block scheduler, and to allow the code to agnostically scale with the number of cores. The only supported inter-block synchronization method is to launch another kernel: kernel launches (within the same stream) are implicit synchronization points.

您code违反了块独立的规则,因为它隐含地假设你的内核线程块并行执行(并行参见)。但谁也不能保证,他们这样做。为了说明为什么这关系到你的code,让我们考虑一个假设的GPU只有一个核心。我们还假设您只需要启动两个线程块。您SPINLOOP内核实际上会死锁在这种情况下。如果线程块零一次计划的核心,它会永远循环下去时,它得到的障碍,因为线程块一个人永远有机会以更新计数器。由于线程块零永远不会换出(线程块执行自己完成),它饿死的核心线程块之一,而它旋转。

Your code violates the block independence rule because it implicitly assumes that your kernel's thread blocks execute concurrently (cf. in parallel). But there's no guarantee that they do. To see why this matters to your code, let's consider a hypothetical GPU with only one core. We'll also assume that you only want to launch two thread blocks. Your spinloop kernel will actually deadlock in this situation. If thread block zero is scheduled on the core first, it will loop forever when it gets to the barrier, because thread block one never has a chance to update the counter. Because thread block zero is never swapped out (thread blocks execute to their completion) it starves thread block one of the core while it spins.

有些人曾尝试方案,例如你和所看到的成功,因为调度发生在这样一种方式偶然调度块,该假设制定。例如,有发射许多线程块时作为GPU具有SM的指该块被真正同时执行一个时间。但他们失望时,更改了驱动程序或运行CUDA GPU或失效的假设,打破了他们的code。

Some folks have tried schemes such as yours and have seen success because the scheduler happened to serendipitously schedule blocks in such a way that the assumptions worked out. For example, there was a time when launching as many thread blocks as a GPU has SMs meant that the blocks were truly executed concurrently. But they were disappointed when a change to the driver or CUDA runtime or GPU invalidated that assumption, breaking their code.

有关您的应用程序,试图找到它不依赖于块间的同步,因为(除非有意义改变CUDA编程模型)的解决方案,它只是是不可能的。

For your application, try to find a solution which doesn't depend on inter-block synchronization, because (barring a signification change to the CUDA programming model) it just isn't possible.

这篇关于在CUDA块间屏障的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆