CUDA:向内核传递参数会减慢内核启动吗? [英] CUDA: Does passing arguments to a kernel slow the kernel launch much?

查看:163
本文介绍了CUDA:向内核传递参数会减慢内核启动吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是CUDA的初学者。

CUDA beginner here.

在我的代码中,我目前正在主机代码中循环多次启动内核。 (因为我需要块之间的同步)。所以我想知道我是否可以优化内核启动。

In my code i am currently launching kernels a lot of times in a loop in the host code. (Because i need synchronization between blocks). So i wondered if i might be able to optimize the kernel launch.

我的内核启动看起来像这样:

My kernel launches look something like this:

MyKernel<<<blocks,threadsperblock>>>(double_ptr, double_ptr, int N, double x);

因此,为了启动内核,一些信号显然必须从CPU到GPU, m不知道参数的传递是否使这个过程明显变慢。

So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably slower.

内核的参数每次都是相同的,所以也许我可以通过复制一次来节省时间,通过

The arguments to the kernel are the same every single time, so perhaps i could save time by copying them once, access them in the kernel by a name defined by

__device__ int N;
<and somehow (how?) copy the value to this name N on the GPU once>

并且只是启动没有任何参数的内核

and simply launch the kernel with no arguments as such

MyKernel<<<blocks,threadsperblock>>>();

这会使我的程序更快吗?
这是什么最好的方法?
AFAIK的参数存储在一些恒定的全局内存中。如何确保手动传输的值存储在快速或快速的内存中?

Will this make my program any faster? What is the best way of doing this? AFAIK the arguments are stored in some constant global memory. How can i make sure that the manually transferred values are stored in a memory which is as fast or faster?

感谢您提供任何帮助。

推荐答案

我希望这种优化的好处相当小。在正常的平台(即除WDDM之外的任何平台)上,内核启动开销只有10-20微秒,所以可能没有很大的改进空间。

I would expect the benefits of such an optimization to be rather small. On sane platforms (ie. anything other than WDDM), kernel launch overhead is only of the order of 10-20 microseconds, so there probably isn't a lot of scope to improve.

话虽如此,如果你想尝试,影响这一点的逻辑方式是使用常量内存。将每个参数定义为转换单位作用域中的 __ constant __ 符号,然后使用 cudaMemcpyToSymbol 函数将值从主机复制到设备常量内存。

Having said that, if you want to try, the logical way to affect this is using constant memory. Define each argument as a __constant__ symbol at translation unit scope, then use the cudaMemcpyToSymbol function to copy values from the host to device constant memory.

这篇关于CUDA:向内核传递参数会减慢内核启动吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆