CUDA 确定每个块的线程数、每个网格的块数 [英] CUDA determining threads per block, blocks per grid

查看:31
本文介绍了CUDA 确定每个块的线程数、每个网格的块数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 CUDA 范式的新手.我的问题是确定每个块的线程数和每个网格的块数.有一点艺术和试验的作用吗?我发现很多例子都为这些东西选择了看似任意的数字.

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things.

我正在考虑一个问题,我可以将任意大小的矩阵传递给乘法方法.因此,C 的每个元素(如 C = A * B)将由单个线程计算.在这种情况下,您将如何确定线程/块、块/网格?

I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case?

推荐答案

通常,您希望调整块/网格的大小以匹配您的数据并同时最大化占用率,即一次有多少线程处于活动状态.影响占用的主要因素是共享内存使用、寄存器使用和线程块大小.

In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size.

启用 CUDA 的 GPU 将其处理能力拆分为 SM(流式多处理器),SM 的数量取决于实际的卡,但为了简单起见,这里我们将重点关注单个 SM(它们的行为都相同).每个 SM 都有有限数量的 32 位寄存器、共享内存、最大数量的活动块和最大数量的活动线程.这些数字取决于您的 GPU 的 CC(计算能力),可以在 Wikipedia 文章 http://en 的中间找到.wikipedia.org/wiki/CUDA.

A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), and the number of SMs depends on the actual card, but here we'll focus on a single SM for simplicity (they all behave the same). Each SM has a finite number of 32 bit registers, shared memory, a maximum number of active blocks, AND a maximum number of active threads. These numbers depend on the CC (compute capability) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA.

首先,您的线程块大小应该始终是 32 的倍数,因为内核在 warp 中发出指令(32 个线程).例如,如果您有 50 个线程的块大小,GPU 仍然会向 64 个线程发出命令,而您只是在浪费它们.

First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them.

其次,在担心共享内存和寄存器之前,请尝试根据与卡的计算能力相对应的最大线程数和块数来调整块的大小.有时有多种方法可以做到这一点……例如,CC 3.0 卡每个 SM 可以有 16 个活动块和 2048 个活动线程.这意味着如果每个块有 128 个线程,则可以在 SM 中容纳 16 个块,然后再达到 2048 个线程的限制.如果您使用 256 个线程,则只能容纳 8 个线程,但您仍在使用所有可用线程并且仍将完全占用.然而,当达到 16 个块的限制时,每个块使用 64 个线程只会使用 1024 个线程,因此只有 50% 的占用率.如果共享内存和寄存器的使用不是瓶颈,这应该是您主要关心的问题(除了您的数据维度).

Second, before worrying about shared memory and registers, try to size your blocks based on the maximum numbers of threads and blocks that correspond to the compute capability of your card. Sometimes there are multiple ways to do this... for example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads. This means if you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048 thread limit. If you use 256 threads, you can only fit 8, but you're still using all of the available threads and will still have full occupancy. However using 64 threads per block will only use 1024 threads when the 16 block limit is hit, so only 50% occupancy. If shared memory and register usage is not a bottleneck, this should be your main concern (other than your data dimensions).

关于您的网格主题...您的网格中的块分散在 SM 上以开始,然后将剩余的块放入管道中.一旦 SM 中有足够的资源来获取块,块就会被移动到 SM 中进行处理.换句话说,当块在 SM 中完成时,新的块被移入.您可以提出这样的论点:具有较小的块(128 而不是前面示例中的 256)可能会更快地完成,因为特别慢的块将占用更少的资源,但是这在很大程度上取决于代码.

On the topic of your grid... the blocks in your grid are spread out over the SMs to start, and then the remaining blocks are placed into a pipeline. Blocks are moved into the SMs for processing as soon as there are enough resources in that SM to take the block. In other words, as blocks complete in an SM, new ones are moved in. You could make the argument that having smaller blocks (128 instead of 256 in the previous example) may complete faster since a particularly slow block will hog fewer resources, but this is very much dependent on the code.

关于寄存器和共享内存,请看下一个,因为它可能会限制您的占用.共享内存对于整个 SM 来说是有限的,因此请尝试以允许尽可能多的块仍然适合 SM 的数量使用它.寄存器使用也是如此.同样,这些数字取决于计算能力,可以在维基百科页面上找到.祝你好运!

Regarding registers and shared memory, look at that next, as it may be limiting your occupancy. Shared memory is finite for a whole SM, so try to use it in an amount that allows as many blocks as possible to still fit on an SM. The same goes for register use. Again, these numbers depend on compute capability and can be found tabulated on the wikipedia page. Good luck!

这篇关于CUDA 确定每个块的线程数、每个网格的块数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆