在CUDA内核嵌套 [英] Nested kernels in CUDA

查看：225 发布时间：2016/6/1 19:47:29 arrays cuda

本文介绍了在CUDA内核嵌套的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

CUDA目前不允许嵌套的内核。

CUDA currently does not allow nested kernels.

要具体，我有以下问题：
我有M维数据的N个。处理每个N个数据点的三内核需要在序列中运行。一直以来，仁嵌套是不允许的，我不能创建调用这三个内核的内核。因此，我必须串行处理每个数据点。

To be specific, I have the following problem: I have N number of M-dimensional data. To process each of the N data-points, three kernels need to be run in a sequence. Since, nesting of kernels is not allowed, I cannot create a kernel with calls to the three kernels. Therefore, I have to process each data-point serially.

一个解决方法是写一个包含所有其他三个内核的功能的大内核，但我认为这将次优的。

One solution is to write a big kernel containing the functionality of all the other three kernels, but I think it will sub-optimal.

任何人能否提供流如何可用于并行运行的N个数据点，同时保留了三个更小的内核。

Can anyone suggest how streams can be used to run the N data-points in parallel, while retaining the the three smaller kernels.

感谢。

推荐答案

好吧，如果你想使用流......你将要创建n个流：

Well, if you want to use streams... you will want to create N streams:

cudaStream_t streams;
streams = malloc(N * sizeof(cudaStream_t));
for(i=0; i<N; i++)
{
    cudaStreamCreate(&streams[i]);
}

那么对于第i个数据点，你想用cudaMemcpyAsync用于传输：

Then for the ith data point, you want to use cudaMemcpyAsync for transfers:

cudaMemcpyAsync(dst, src, kind, count, streams[i]);

和调用带四个参数配置您的内核（sharedMemory可以是0，当然）：

and call your kernels with all four configuration parameters (sharedMemory can be 0, of course):

kernel_1 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
kernel_2 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );

和清理过程的：

for(i=0; i<N; i++)
{
    cudaStreamDestroy(streams[i]);
}
free(streams)

这篇关于在CUDA内核嵌套的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在CUDA内核嵌套 [英] Nested kernels in CUDA

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在CUDA内核嵌套 [英] Nested kernels in CUDA

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭