这有什么CUDA code返回此意外发生输出? [英] What is happening with this CUDA code that returns this unexpected output?

查看:107
本文介绍了这有什么CUDA code返回此意外发生输出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说完最后得到动态并行运行起来,我想现在执行它我的模型。我花了一段时间才能弄清楚,一些奇怪的输出是由于需要使用cud​​aDeviceSynchronize(),使内核家长等待孩子内核完成。

似乎有什么毛病我定义为arrAdd设备功能。下面是输出表前和K2父内核每个孩子的内核了。

 最初:K1 = {-1 0 0 0 0}
帖子arrInit:TEMP = {0.25 0.25 0.25 0.25 0.25}
帖子arrMult:TEMP = {-.25 0 0 0 0}
后arrAdd:TEMP = {-8 0 0 0 0}
预计:TEMP = {-.50 0 0 0 0}
__global__无效K2(双* CONCS,为int *最大长度,双* K1S,双* K2S,双*温度,双* tempsum)
{
    INT IDX = blockIdx.x * blockDim.x + threadIdx.x;
    双A21 = 0.25;    arrInit<<< 1,*最大长度>>>(TEMP,A21); // TEMP = A21
    cudaDeviceSynchronize();
    arrMult<<< 1,*最大长度>>>(K1S,温度,温度); // TEMP = A21 * K1
    cudaDeviceSynchronize();
    arrAdd<<< 1,*最大长度>>>(温度,温度,温度); // TEMP = 2 * A21 * K1
    cudaDeviceSynchronize();
}__global__无效arrAdd(双*一,双* B,双* C)
{
    INT IDX = blockIdx.x * blockDim.x + threadIdx.x;
    C [IDX] = A [IDX] + B [IDX];
}
__global__无效arrMult(双*一,双* B,双* C)
{
    INT IDX = blockIdx.x * blockDim.x + threadIdx.x;
    C [IDX] = A [IDX] * B [IDX];
}
__global__无效arrInit(双*一,双二)
{
    INT IDX = blockIdx.x * blockDim.x + threadIdx.x;
    一个[IDX] = B;
}


解决方案

您可能不需要与父内核同步。儿童内核执行由父母指定的内核和家长的内核到底是一个隐含的同步点的最后一个子内核的顺序。

当您使用动态并行,请注意以下事项:


  1. 你可以去最深的是24(CC = 3.5)。


  2. 有在同一时间发射未决动态内核的数量(在CC默认2048 = 3.5)是有限的,但可以增加


  3. 请家长内核忙于一个很好的机会,你浪费资源,孩子的内核调用,否则后。


我猜你的奇怪的错误结果从上面提到的第二个因素产生。当你打的限制,一些动态内核根本不运行,如果你不检查错误,你不会注意到,因为错误的创建机制是每个线程。

您可以增加此限制的 cudaDeviceSetLimit() cudaLimitDevRuntimePendingLaunchCount 作为上限。但您指定的越多,你消耗的全局内存空间。看一看文档 rel=\"nofollow\">第C.4.3.1.3。

Having finally gotten Dynamic Parallelism up and running, I'm trying to now implement my model with it. It took me a while to figure out that some strange output resulted from needing to use cudaDeviceSynchronize() to make the parent kernel wait for the child kernel to finish.

It seems there is something wrong with the device function I defined as arrAdd. Here's a table of outputs before and after each child kernel in the k2 parent kernel.

Initially    : k1   = { -1   0   0   0   0 }
Post arrInit : temp = { .25 .25 .25 .25 .25}
Post arrMult : temp = {-.25  0   0   0   0 }
post arrAdd  : temp = { -8   0   0   0   0 }
Expected     : temp = {-.50  0   0   0   0 }


__global__ void k2(double* concs, int* maxlength, double* k1s, double* k2s, double * temp, double* tempsum)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    double a21 = .25;

    arrInit<<< 1, *maxlength >>>(temp, a21);                //temp = a21
    cudaDeviceSynchronize();
    arrMult<<< 1, *maxlength >>>(k1s, temp, temp);          //temp = a21*k1
    cudaDeviceSynchronize();
    arrAdd<<< 1, *maxlength >>>(temp, temp, temp);          //temp = 2*a21*k1
    cudaDeviceSynchronize();
}

__global__ void arrAdd(double* a, double* b, double* c)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]+b[idx];
}
__global__ void arrMult(double* a, double* b, double* c)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]*b[idx];
}
__global__ void arrInit(double* a, double b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    a[idx]=b;
}

解决方案

You probably don't need to synchronize with the parent kernel. Child kernels execute in the order specified by parent kernel and the end of parent kernel is an implicit synchronization point with the last child kernel.

When you use dynamic parallelism, be careful about these items:

  1. The deepest you can go is 24 (CC=3.5).

  2. The number of dynamic kernels pending for launch at the same time is limited ( default 2048 at CC=3.5) but can be increased.

  3. Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.

I guess your strange wrong results originate from the second factor mentioned above. When you hit the limit, some of dynamic kernels simply don't run and if you don't check for errors, you won't notice because error creation mechanism is per thread.

You can increase this limit by cudaDeviceSetLimit() having cudaLimitDevRuntimePendingLaunchCount as the limit. But the more you specify, the more you consume global memory space. Have a look at section C.4.3.1.3 of the documentation here.

这篇关于这有什么CUDA code返回此意外发生输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆