如何为 CUDA 内核选择网格和块尺寸? [英] How do I choose grid and block dimensions for CUDA kernels?

查看:33
本文介绍了如何为 CUDA 内核选择网格和块尺寸?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个关于如何确定CUDA网格、块和线程大小的问题.这是对此处发布的问题的附加问题.

This is a question about how to determine the CUDA grid, block and thread sizes. This is an additional question to the one posted here.

在此链接之后,talonmies 的答案包含一个代码片段(见下文).我不明白评论通常由调整和硬件限制选择的值".

Following this link, the answer from talonmies contains a code snippet (see below). I don't understand the comment "value usually chosen by tuning and hardware constraints".

我在 CUDA 文档中没有找到很好的解释或说明来解释这一点.总之,我的问题是如何确定最佳 blocksize(线程数)给定以下代码:

I haven't found a good explanation or clarification that explains this in the CUDA documentation. In summary, my question is how to determine the optimal blocksize (number of threads) given the following code:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / nthreads; // value determine by block size and total work
madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

推荐答案

这个答案有两个部分(我写的).一方面易于量化,另一方面则更具经验性.

There are two parts to that answer (I wrote it). One part is easy to quantify, the other is more empirical.

这是容易量化的部分.当前 CUDA 编程指南的附录 F 列出了许多硬限制,这些限制限制了内核启动可以具有的每个块的线程数.如果超过其中任何一个,您的内核将永远不会运行.大致可以概括为:

This is the easy to quantify part. Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have. If you exceed any of these, your kernel will never run. They can be roughly summarized as:

  1. 每个块的线程总数不能超过 512/1024 个(计算能力 1.x 或 2.x 及更高版本)
  2. 每个块的最大尺寸限制为[512,512,64]/[1024,1024,64](计算 1.x/2.x 或更高版本)
  3. 每个块消耗的寄存器总数不能超过 8k/16k/32k/64k/32k/64k/32k/64k/32k/64k(计算 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0)
  4. 每个块不能消耗超过 16kb/48kb/96kb 的共享内存(计算1.x/2.x-6.2/7.0)
  1. Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x and later respectively)
  2. The maximum dimensions of each block are limited to [512,512,64]/[1024,1024,64] (Compute 1.x/2.x or later)
  3. Each block cannot consume more than 8k/16k/32k/64k/32k/64k/32k/64k/32k/64k registers total (Compute 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0)
  4. Each block cannot consume more than 16kb/48kb/96kb of shared memory (Compute 1.x/2.x-6.2/7.0)

如果您保持在这些限制内,您可以成功编译的任何内核都将无错误地启动.

If you stay within those limits, any kernel you can successfully compile will launch without error.

这是经验部分.您在上面概述的硬件约束内选择的每个块的线程数可以并且确实会影响在硬件上运行的代码的性能.每个代码的行为方式会有所不同,量化它的唯一真正方法是通过仔细的基准测试和分析.但同样,非常粗略地总结:

This is the empirical part. The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware. How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling. But again, very roughly summarized:

  1. 每个块的线程数应该是 warp 大小的整数倍,在所有当前硬件上为 32.
  2. GPU 上的每个流式多处理器单元都必须具有足够的活动扭曲,以充分隐藏架构的所有不同内存和指令流水线延迟并实现最大吞吐量.这里的正统方法是尝试实现最佳硬件占用(罗杰达尔的回答所指的内容).
  1. The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware.
  2. Each streaming multiprocessor unit on the GPU must have enough active warps to sufficiently hide all of the different memory and instruction pipeline latency of the architecture and achieve maximum throughput. The orthodox approach here is to try achieving optimal hardware occupancy (what Roger Dahl's answer is referring to).

第二点是一个很大的话题,我怀疑有人会尝试在单个 StackOverflow 答案中涵盖它.有人围绕问题的各个方面进行定量分析(参见 来自加州大学伯克利分校的 Vasily Volkov 的这篇演讲和来自大学的 Henry Wong 的这篇论文多伦多的例子来说明这个问题到底有多复杂).

The second point is a huge topic which I doubt anyone is going to try and cover it in a single StackOverflow answer. There are people writing PhD theses around the quantitative analysis of aspects of the problem (see this presentation by Vasily Volkov from UC Berkley and this paper by Henry Wong from the University of Toronto for examples of how complex the question really is).

在入门级,您应该主要意识到您选择的块大小(在上述约束定义的合法块大小范围内)可以并且确实对您的代码运行速度产生影响,但这取决于在您拥有的硬件和正在运行的代码上.通过基准测试,您可能会发现大多数非平凡代码都有一个甜蜜点".在每块 128-512 个线程的范围内,但您需要进行一些分析才能找到它的位置.好消息是,由于您使用的是扭曲大小的倍数,因此搜索空间非常有限,并且相对容易找到给定代码段的最佳配置.

At the entry level, you should mostly be aware that the block size you choose (within the range of legal block sizes defined by the constraints above) can and does have a impact on how fast your code will run, but it depends on the hardware you have and the code you are running. By benchmarking, you will probably find that most non-trivial code has a "sweet spot" in the 128-512 threads per block range, but it will require some analysis on your part to find where that is. The good news is that because you are working in multiples of the warp size, the search space is very finite and the best configuration for a given piece of code relatively easy to find.

这篇关于如何为 CUDA 内核选择网格和块尺寸?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆