CUDA确定每个块的线程,每个网格块 [英] CUDA determining threads per block, blocks per grid

查看:172
本文介绍了CUDA确定每个块的线程,每个网格块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA范例的新手。我的问题是确定每个块的线程数,每个网格块。有点艺术和审判玩这个吗?我发现很多例子似乎是为这些东西选择任意数字。

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things.

我正在考虑一个问题,我可以传递矩阵 - 任何size - 一个乘法的方法。因此,C的每个元素(如在C = A * B中)将由单个线程计算。在这种情况下,如何确定线程/块,块/网格?

I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case?

推荐答案

网格以匹配您的数据并同时最大化占用率,即一次有多少线程处于活动状态。影响占用的主要因素是共享存储器使用,寄存器使用和线程块大小。

In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size.

启用CUDA的GPU将其处理能力分为SM(流多处理器)和SM的数量取决于实际的卡,但在这里我们将关注一个简​​单的SM(他们都表现相同)。每个SM具有有限数量的32位寄存器,共享存储器,活动块的最大数目,以及活动线程的最大数目。这些数字取决于您的GPU的CC(计算能力),可以在维基百科文章 http:// en。 wikipedia.org/wiki/CUDA

A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), and the number of SMs depends on the actual card, but here we'll focus on a single SM for simplicity (they all behave the same). Each SM has a finite number of 32 bit registers, shared memory, a maximum number of active blocks, AND a maximum number of active threads. These numbers depend on the CC (compute capability) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA.

首先,你的线程块大小应该总是32的倍数,因为内核在warp中发出指令32个线程)。例如,如果你有50个线程的块大小,GPU仍然会发布命令到64个线程,你只是浪费他们。

First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them.

第二,在担心关于共享内存和寄存器,请尝试根据与卡的计算能力相对应的最大线程数和块来确定块大小。有时候有多种方法可以做到这一点...例如,一个CC 3.0卡每个SM可以有16个活动块和2048个活动线程。这意味着如果每个块有128个线程,那么在达到2048线程限制之前,您可以在SM中安装16个块。如果你使用256个线程,你只能适合8,但你仍然使用所有可用的线程,仍然将完全占用。然而,当每个块使用64个线程将仅使用1024个线程,当16个块限制被命中时,所以只有50%的占用率。如果共享内存和寄存器使用不是瓶颈,这应该是您的主要关注(除了您的数据维度)。

Second, before worrying about shared memory and registers, try to size your blocks based on the maximum numbers of threads and blocks that correspond to the compute capability of your card. Sometimes there are multiple ways to do this... for example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads. This means if you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048 thread limit. If you use 256 threads, you can only fit 8, but you're still using all of the available threads and will still have full occupancy. However using 64 threads per block will only use 1024 threads when the 16 block limit is hit, so only 50% occupancy. If shared memory and register usage is not a bottleneck, this should be your main concern (other than your data dimensions).

在网格的主题...网格中的块被分散在S​​M上以开始,然后剩余的块被放置到流水线中。只要在该SM中有足够的资源来获取块,块就被移动到SM中以进行处理。换句话说,作为SM中的块完成,新的块被移入。可以使具有较小块(在前面的示例中为128而不是256)的参数可以更快地完成,因为特别慢的块将占用较少的资源,但是这是非常依赖于代码。

On the topic of your grid... the blocks in your grid are spread out over the SMs to start, and then the remaining blocks are placed into a pipeline. Blocks are moved into the SMs for processing as soon as there are enough resources in that SM to take the block. In other words, as blocks complete in an SM, new ones are moved in. You could make the argument that having smaller blocks (128 instead of 256 in the previous example) may complete faster since a particularly slow block will hog fewer resources, but this is very much dependent on the code.

关于寄存器和共享内存,看看下一个,因为它可能限制你的占用。共享内存对于整个SM是有限的,所以尝试使用它允许尽可能多的块仍然适合在SM上的量。使用寄存器也是如此。同样,这些数字取决于计算能力,可以在维基百科页面上列出。祝你好运!

Regarding registers and shared memory, look at that next, as it may be limiting your occupancy. Shared memory is finite for a whole SM, so try to use it in an amount that allows as many blocks as possible to still fit on an SM. The same goes for register use. Again, these numbers depend on compute capability and can be found tabulated on the wikipedia page. Good luck!

这篇关于CUDA确定每个块的线程,每个网格块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆