了解CUDA网格尺寸，块尺寸和线程组织（简单说明） [英] Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

查看：311 发布时间：2017/3/4 11:30:48 cuda nvidia

本文介绍了了解CUDA网格尺寸，块尺寸和线程组织（简单说明）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

线程是如何组织由GPU执行的？

How are threads organized to be executed by a GPU?

硬件

如果GPU设备具有例如4个多处理单元，并且它们可以每个运行768个线程：那么在给定的时刻，不超过4 * 768个线程将真正并行运行（如果你计划更多线程，它们将轮流等待。）

Hardware

If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).

线程按块组织。块由多处理单元执行。
块的线程可以使用1Dimension（x），2Dimensions（x，y）或3Dim索引（x，y，z）进行识别（索引），但在任何情况下都是x y 对于我们的示例，z <= 768（其他限制适用于x，y，z，请参阅指南和您的设备功能）。

threads are organized in blocks. A block is executed by a multiprocessing unit. The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restriccions apply to x,y,z, see the guide and your device capability).

很明显，那些4 * 768线程你需要超过4块。
块也可以建立1D，2D或3D索引。有一个块队列等待进入
的GPU（因为，在我们的例子中，GPU有4个多处理器，只有4个块是
同时执行）。

Obviously, if you need more than those 4*768 threads you need more than 4 blocks. Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are being executed simultaneously).

假设我们想要一个线程处理一个像素（i，j）。

Suppose we want one thread to process one pixel (i,j).

我们可以使用每个64个线程的块。那么我们需要512 * 512/64 = 4096块
（所以有512x512个线程= 4096 * 64）

We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks (so to have 512x512 threads = 4096*64)

图像更容易）具有blockDim = 8×8（每块64个线程）的2D块中的线程。我更喜欢调用threadsPerBlock。

It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.

dim3 threadsPerBlock(8, 8);  // 64 threads

和2D gridDim = 64 x 64块（需要4096个块）。我更喜欢调用它numBlocks。

and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.

dim3 numBlocks(imageWidth/threadsPerBlock.x,  /* for instance 512/8 = 64*/
              imageHeight/threadsPerBlock.y);

内核是这样启动的：

myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );

最后：会有类似4096个块的队列

Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.

在内核中，要由线程处理的像素（i，j）被计算为方式：

In the kernel the pixel (i,j) to be processed by a thread is calculated this way:

uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;

这篇关于了解CUDA网格尺寸，块尺寸和线程组织（简单说明）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

了解CUDA网格尺寸，块尺寸和线程组织（简单说明） [英] Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

问题描述

推荐答案

硬件

Hardware

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

了解CUDA网格尺寸，块尺寸和线程组织（简单说明） [英] Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

问题描述

推荐答案

硬件

Hardware

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭