全局内核中的CUDA变量 [英] CUDA variables inside global kernel

查看:87
本文介绍了全局内核中的CUDA变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是:

1)我是否理解正确,当您在全局内核中声明一个变量时,每个线程将有该变量的不同副本.这样,您就可以为每个线程在此变量中存储一些中间结果.例如:向量c = a + b:

1) Did I understand correct, that when you declare a variable in the global kernel, there will be different copies of this variable for each thread. That allows you to store some intermediate result in this variable for every thread. Example: vector c=a+b:

__global__ void addKernel(int *c, const int *a, const int *b)
{
   int i = threadIdx.x;
   int p;
   p = a[i] + b[i];
   c[i] = p;
} 

在这里,我们声明中间变量p.但实际上,此变量有N个副本,每个线程每个副本.

Here we declare intermediate variable p. But in reality there are N copies of this variable, each one for each thread.

2)是的,如果我要声明数组,那么将为每个线程创建N个该数组的副本吗?只要全局内核中的所有事情都发生在gpu内存上,对于声明的任何变量,您都需要gpu上的N倍的内存,其中N是线程数.

2) Is it true, that if I will declare array, N copies of this array will be created, each for each thread? And as long as everything inside the global kernel happens on gpu memory, you need N times more memory on gpu for any variable declared, where N is the number of your threads.

3)在我当前的程序中,我有35 * 48 = 1680个块,每个块包括32 * 32 = 1024个线程.这是否意味着在全局内核中声明的任何变量将使我的花费比内核之外的花费多N = 1024 * 1680 = 1 720 320倍?

3) In my current program I have 35*48= 1680 blocks, each block include 32*32=1024 threads. Does it mean, that any variable declared within a global kernel will cost me N=1024*1680=1 720 320 times more than outside the kernel?

4)要使用共享内存,每个变量需要的内存比平时多M倍. M是的数量.是真的吗?

4) To use shared memory, I need M times more memory for each variable than usually. Here M is the number of blocks. Is that true?

推荐答案

1)是.每个线程都有一个在函数中声明的非共享变量的私有副本.这些通常会进入GPU register内存,尽管会溢出到local内存中.

1) Yes. Each thread has a private copy of non-shared variables declared in the function. These usually go into GPU register memory, though can spill into local memory.

2) 3) 4)虽然确实需要许多私有内存副本,但这并不意味着您的GPU必须同时为每个线程拥有足够的私有内存.这是因为在硬件中,并非所有线程都需要同时执行.例如,如果您启动N个线程,则可能是在给定的时间有一半处于活动状态,而另一半则要等到有空闲资源来运行时才启动.

2), 3) and 4) While it's true that you need many copies of that private memory, that doesn't mean your GPU has to have enough private memory for every thread at once. This is because in hardware, not all threads need to execute simultaneously. For example, if you launch N threads it may be that half are active at a given time and the other half won't start until there are free resources to run them.

线程使用的资源越多,硬件可以同时运行的资源越少,但这并不限制您可以要求运行的资源数量,因为GPU没有资源的任何线程都会在某些线程中运行释放资源.

The more resources your threads use the fewer can be run simultaneously by the hardware, but that doesn't limit how many you can ask to be run, as any threads the GPU doesn't have resource for will be run once some resources free up.

这并不意味着您应该发疯并声明大量的本地资源. GPU之所以快速,是因为它能够并行运行线程.要并行运行这些线程,需要在任何给定时间容纳很多线程.从一般意义上讲,每个线程使用的资源越多,在给定的时刻处于活动状态的线程就越少,硬件可以利用的并行性就越少.

This doesn't mean you should go crazy and declare massive amounts of local resources. A GPU is fast because it is able to run threads in parallel. To run these threads in parallel it needs to fit a lot of threads at any given time. In a very general sense, the more resources you use per thread, the fewer threads will be active at a given moment, and the less parallelism the hardware can exploit.

这篇关于全局内核中的CUDA变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆