CUDA中的SM如何同时运行多个块? [英] How do a SM in CUDA run multiple blocks simultaneously?

查看:1090
本文介绍了CUDA中的SM如何同时运行多个块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在CUDA中,一个SM可以同时运行多个块,如果每个块不会花费太多的资源。



在费米,我们知道一个SM由 32kb 注册空间使用。假设一个线程使用32个寄存器,所以这个SM可以包含一个包含 256((32 * 1024)/(32 * 4))线程的块。如果SM可以同时运行多个块,我们还可以为一个块配置32个逻辑,为SM配置8个块。

解决方案

正如@talonmies所说,你的数学不完全正确。但关键点是SM包含许多不同类型资源的平衡。



我没有检查过Kepler的数字(计算能力3.x),但是你的内核和内核启动参数越适合这个平衡。对于Fermi(2.x),SM可以跟踪48个并发的warp(1,536个线程)和8个并发块。这意味着如果你为你的块选择了一个低线程数,8个并发块成为占用你的内核的限制因素。例如,如果你每个块选择32个线程,你最多可以在SM上运行256(8 * 32)个并发线程,而SM可以运行多达1,536个线程(48 * 32)。



在占用计算器中,您可以看到不同的硬件限制,它会告诉你哪些是你的特定内核的限制因素。您可以尝试启动参数,共享内存使用情况和注册使用情况的变化,了解它们如何影响您的入住率。



入住率不是一切。增加的占用转化为增加的隐藏存储器传输的延迟的能力。当存储器带宽饱和时,增加占用进一步没有帮助。在游戏中还有另一个效果。增加块的大小可以减少占用率,但同时增加在内核中可用的指令级并行性(ILP)的量。在这种情况下,减少占用可以提高性能。


In CUDA, can a SM run multiple blocks simultaneously if each block won't cost too much resource.

On Fermi, we know that a SM consists of 32kb register space for use. suppose a thread use 32 register, so this SM can lanuch one block which contains 256 ((32*1024)/(32*4)) threads. If SM can run multiple blocks simultaneously, we can also configure 32 theards for a block, and 8 block for the SM. Is there any difference?

解决方案

As @talonmies commented, your math is not entirely correct. But the key point is that an SM contains a balance of many different types of resources. The better your kernel and kernel launch parameters fit with this balance, the better your performance.

I haven't checked the numbers for Kepler (compute capability 3.x) but for Fermi (2.x), an SM can keep track of 48 concurrent warps (1,536 threads) and 8 concurrent blocks. This means that if you chose a low thread count for your blocks, the 8 concurrent blocks becomes the limiting factor to occupancy in your kernel. For instance, if you chose 32 threads per block, you get up to 256 (8 * 32) concurrent threads running on the SM while the SM can run up to 1,536 threads (48 * 32).

In the occupancy calculator, you can see what the different hardware limits are and it will tell you which of them becomes the limiting factor with your specific kernel. You can experiment with variations in launch parameters, shared memory usage and register usage to see how they affect your occupancy.

Occupancy is not everything when it comes to performance. Increased occupancy translates to increased ability to hide the latency of memory transfers. When the memory bandwidth is saturated, increasing occupancy further does not help. There is another effect in play as well. Increasing the size of a block may decrease occupancy but at the same time increase the amount of instruction level parallelism (ILP) available in your kernel. In this case, decreasing occupancy can increase performance.

这篇关于CUDA中的SM如何同时运行多个块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆