分割大型CUDA内核和使用动态并行性的好处 [英] Benefit of splitting a big CUDA kernel and using dynamic parallelism

查看:155
本文介绍了分割大型CUDA内核和使用动态并行性的好处的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大内核,其中初始状态使用不同的技术进化。也就是说,我在内核中有一个循环,在这个循环中,一个特定的谓词是根据当前状态和这个谓词的结果进行计算,采取一定的动作。

I have a big kernel in which an initial state is evolved using different techniques. That is, I have a loop in the kernel, in this loop a certain predicate is evaluated on the current state and on the result of this predicate, a certain action is taken.

内核需要一点临时数据和共享内存,但由于它是大的,它使用63个寄存器,占用率非常低。

The kernel needs a bit of temporary data and shared memory, but since it is big it uses 63 registers and the occupancy is very very low.

我想在许多小内核中拆分内核,但是每个块完全独立于其他内核,我认为我不能在主机代码上使用单个线程来启动多个小内核。

I would like to split the kernel in many little kernels, but every block is totally independent from the others and I (think I) can't use a single thread on the host code to launch multiple small kernels.

我不知道流是否适合这种工作,我从来没有使用它们,但由于我有选择使用动态并行性,我想如果这是一个很好的选择来实现这种类型工作。
从内核启动内核很快?
我需要复制全局内存中的数据,以使它们可用于子内核吗?

I am not sure if streams are adequate for this kind of work, I never used them, but since I have the option to use the dynamic parallelism, I would like if that is a good option to implement this kind of job. Is it fast to launch a kernel from a kernel? Do I need to copy data in global memory to make them available to a sub-kernel?

如果我在大多数小内核中分割我的大内核,在需要时调用所需内核的主循环(这允许我在每个子内核中移动临时变量)留下第一个内核,这将帮助我增加占用率?

If I split my big kernel in many little ones, and leave the first kernel with a main loop which calls the required kernel when necessary (which allows me to move temporary variables in every sub-kernel), will help me increase the occupancy?

我知道这是一个一般的问题,但我不知道这种技术,如果它适合我​​的情况下,或者如果流更好。

I know it is a bit generic question, but I do not know this technology and I would like if it fits my case or if streams are better.

编辑:
要提供一些其他细节,你可以想象我的内核有这样的结构:

To provide some other details, you can imagine my kernel to have this kind of structure:

__global__ void kernel(int *sampleData, int *initialData) {
    __shared__ int systemState[N];
    __shared__ int someTemp[N * 3];
    __shared__ int time;
    int tid = ...;
    systemState[tid] = initialData[tid];

    while (time < TIME_END) {
        bool c = calc_something(systemState);
        if (c)
            break;
        someTemp[tid] = do_something(systemState);
        c = do_check(someTemp);
        if (__syncthreads_or(c))
            break;
        sample(sampleData, systemState);
        if (__syncthreads_and(...)) {
            do_something(systemState);
            sync();
            time += some_increment(systemState);
        }
        else {
            calcNewTemp(someTemp, systemState);
            sync();
            do_something_else(someTemp, systemState);
            time += some_other_increment(someTemp, systemState);
        }
    }
    do_some_stats();
}

这是为了表明有一个主循环,

this is to show you that there is a main loop, that there are temporary data which are used somewhere and not in other points, that there are shared data, synchronization points, etc.

线程用于计算矢量数据,而理想情况下, ,每个块中有一个单循环(当然,这不是真的,但逻辑上是)...每个块有一个大流量。

Threads are used to compute vectorial data, while there is, ideally, one single loop in each block (well, of course it is not true, but logically it is)... One "big flow" for each block.

,我不知道如何使用流在这种情况下...大循环在哪里?在主机我猜...但是如何协调,从单个循环,所有的块?这是什么让我最可疑的。我可以使用不同主机线程的流(每个块一个线程)吗?

Now, I am not sure about how to use streams in this case... Where is the "big loop"? On the host I guess... But how do I coordinate, from a single loop, all the blocks? This is what leaves me most dubious. May I use streams from different host threads (One thread per block)?

我对动态并行性不太怀疑,因为我可以轻松地保持大循环运行,

I am less dubious about dynamic parallelism, because I could easily keep the big loop running, but I am not sure if I could have advantages here.

推荐答案

我已经从动态并行化中获益于解决形式的插值问题:

I have benefitted from dynamic parallelism for solving an interpolation problem of the form:

int i = threadIdx.x + blockDim.x * blockIdx.x;

for(int m=0; m<(2*K+1); m++) {

    PP1 = calculate_PP1(i,m);
    phi_cap1 = calculate_phi_cap1(i,m);  

        for(int n=0; n<(2*K+1); n++) {

            PP2 = calculate_PP2(i,m);
            phi_cap2 = calculate_phi_cap2(i,n);

            atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }

其中 K = 6 。在这个插值问题中,每个加数的计算独立于其他加法,因此我已将它们分割在一个(2K + 1)x(2K + 1)内核中。

where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel.

从我的(可能不完整的)体验中,如果你有几个独立的迭代,动态并行将有所帮助。对于更大数量的迭代,也许你可能最终通过调用子内核多次,所以你应该检查内核启动的开销是否是限制因素。

From my (possibly incomplete) experience, dynamic paralleism will help if you have a few number of independent iterations. For larger number of iterations, perhaps you could end up by calling the child kernel several times and so you should check if the overhead in kernel launch will be the limiting factor.

这篇关于分割大型CUDA内核和使用动态并行性的好处的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆