如何为CUDA内核选择网格和块尺寸? [英] How do I choose grid and block dimensions for CUDA kernels?

查看:301
本文介绍了如何为CUDA内核选择网格和块尺寸?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个关于如何确定CUDA网格,块和线程大小的问题。这是发布在此处的附加问题:

This is a question about how to determine the CUDA grid, block and thread sizes. This is an additional question to the one posted here:

http:// stackoverflow.com/a/5643838/1292251

在这个链接之后,talonmies的答案包含一个代码片段(见下文)。我不明白评论通常通过调整和硬件约束选择的值。

Following this link, the answer from talonmies contains a code-snippet (see below). I don't understand the comment "value usually chosen by tuning and hardware constraints".

我没有找到一个很好的解释或澄清,解释这在CUDA文档。总之,我的问题是如何确定最佳的 blocksize (=线程数)给定以下代码:

I haven't found a good explanation or clarification that explains this in the CUDA documentation. In summary, my question is how to determine the optimal blocksize (=number of threads) given the following code:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / nthreads; // value determine by block size and total work
madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

BTW,我用上面的链接开始了我的问题,因为它部分回答了我的第一个问题。如果这不是一个正确的方法提出问题Stack Overflow,请原谅或建议我。

BTW, I started my question with the link above because it partly answers my first question. If this is not a proper way to ask questions on Stack Overflow, please excuse or advise me.

推荐答案

这个答案(我写了)。一部分很容易量化,另一部分更经验。

There are two parts to that answer (I wrote it). One part is easy to quantify, the other is more empirical.

量化部分。当前CUDA编程指南的附录F列出了一些硬限制,这些限制限制了每个内核启动可以有多少线程。如果你超过任何这些,你的内核永远不会运行。它们可以粗略概括为:

This is the easy to quantify part. Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have. If you exceed any of these, your kernel will never run. They can be roughly summarized as:


  1. 每个块总共不能超过512/1024个线程(计算能力 1.x或2.x-3.x )

  2. 每个块的最大尺寸限制为
    [512,512,64] / [1024,1024,64](计算1.x / 2.x) li>
  3. 每个块不能消耗超过8k / 16k / 32k个寄存器total
    (计算1.0,1.1 / 1.2,1.3 / 2.x)

  4. 每个块不能消耗超过16kb / 48kb的共享内存(计算
    1.x / 2.x)

  1. Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x-3.x respectively)
  2. The maximum dimensions of each block are limited to [512,512,64]/[1024,1024,64] (Compute 1.x/2.x)
  3. Each block cannot consume more than 8k/16k/32k registers total (Compute 1.0,1.1/1.2,1.3/2.x)
  4. Each block cannot consume more than 16kb/48kb of shared memory (Compute 1.x/2.x)

这是经验部分。您在上述硬件约束中选择的每个块的线程数量可以并且确实影响在硬件上运行的代码的性能。每个代码的行为将是不同的,唯一真正的方式来量化它是通过仔细的基准测试和分析。但同样,非常粗略地总结:

This is the empirical part. The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware. How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling. But again, very roughly summarized:


  1. 每个块的线程数应该是warp的大小的整数倍,

  2. GPU上的每个流多处理器单元必须具有足够的活动warp,以充分隐藏体系结构的所有不同内存和指令流水线延迟,并实现最大吞吐量。这里的正统方法是尝试实现最佳硬件占用率( Roger Dahl的回答指的是)。

  1. The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware.
  2. Each streaming multiprocessor unit on the GPU must have enough active warps to sufficiently hide all of the different memory and instruction pipeline latency of the architecture and achieve maximum throughput. The orthodox approach here is to try achieving optimal hardware occupancy (what Roger Dahl's answer is referring to).

第二点是一个巨大的话题,我怀疑任何人都试图在一个单一的StackOverflow答案覆盖它。有人撰写博士论文对问题的各个方面进行定量分析(请参阅这个演示,来自加州大学伯克利分校的Vasily Volkov和来自加州大学伯克利分校的Henry Wong的这篇论文多伦多的问题的真正复杂的例子)。

The second point is a huge topic which I doubt anyone is going to try and cover it in a single StackOverflow answer. There are people writing PhD theses around the quantitative analysis of aspects of the problem (see this presentation by Vasily Volkov from UC Berkley and this paper by Henry Wong from the University of Toronto for examples of how complex the question really is).

在入门级,你应该注意你选择的块大小(在上面的约束定义的合法块大小的范围内)可以,影响代码运行速度,但这取决于你的硬件和你运行的代码。通过基准测试,您可能会发现大多数非平凡代码在每个块范围的128-512个线程中有一个最佳位置,但是需要对您的部分进行一些分析以找到它的位置。好消息是,因为你工作在warp的大小的倍数,搜索空间是非常有限的,一个给定的代码片段的最佳配置相对容易找到。

At the entry level, you should mostly be aware that the block size you choose (within the range of legal block sizes defined by the constraints above) can and does have a impact on how fast your code will run, but it depends on the hardware you have and the code you are running. By benchmarking, you will probably find that most non-trivial code has a "sweet spot" in the 128-512 threads per block range, but it will require some analysis on your part to find where that is. The good news is that because you are working in multiples of the warp size, the search space is very finite and the best configuration for a given piece of code relatively easy to find.

这篇关于如何为CUDA内核选择网格和块尺寸?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆