了解CUDA网格尺寸、块尺寸和线程组织(简单解释) [英] Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

查看:18
本文介绍了了解CUDA网格尺寸、块尺寸和线程组织(简单解释)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何组织线程以供 GPU 执行?

How are threads organized to be executed by a GPU?

推荐答案

硬件

例如,如果一个 GPU 设备有 4 个多处理单元,并且每个单元可以运行 768 个线程:那么在给定时刻,真正并行运行的线程不会超过 4*768 个(如果您计划更多线程,它们将等待轮到他们).

Hardware

If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).

线程以块的形式组织.一个块由一个多处理单元执行.可以使用 1Dimension(x)、2Dimensions (x,y) 或 3Dim 索引 (x,y,z) 来识别(索引)块的线程,但在任何情况下 xyz <=768 对于我们的示例(其他限制适用于 x、y、z,请参阅指南和您的设备功能).

threads are organized in blocks. A block is executed by a multiprocessing unit. The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).

显然,如果您需要超过 4*768 个线程,则需要超过 4 个块.块也可以索引为 1D、2D 或 3D.有一个等待进入的块队列GPU(因为在我们的示例中,GPU 有 4 个多处理器,只有 4 个块同时执行).

Obviously, if you need more than those 4*768 threads you need more than 4 blocks. Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are being executed simultaneously).

假设我们希望一个线程处理一个像素 (i,j).

Suppose we want one thread to process one pixel (i,j).

我们可以使用每个 64 个线程的块.那么我们需要 512*512/64 = 4096 个块(所以有 512x512 线程 = 4096*64)

We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks (so to have 512x512 threads = 4096*64)

通常在 blockDim = 8 x 8(每个块 64 个线程)的 2D 块中组织(使索引图像更容易)线程.我更喜欢称它为threadsPerBlock.

It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.

dim3 threadsPerBlock(8, 8);  // 64 threads

和 2D gridDim = 64 x 64 块(需要 4096 个块).我更喜欢称它为 numBlocks.

and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.

dim3 numBlocks(imageWidth/threadsPerBlock.x,  /* for instance 512/8 = 64*/
              imageHeight/threadsPerBlock.y); 

内核是这样启动的:

myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );       

最后:会有类似4096 个块的队列"的东西,其中一个块正在等待分配给 GPU 的多处理器之一以执行其 64 个线程.

Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.

在内核中,线程要处理的像素 (i,j) 是这样计算的:

In the kernel the pixel (i,j) to be processed by a thread is calculated this way:

uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;

这篇关于了解CUDA网格尺寸、块尺寸和线程组织(简单解释)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆