为什么较小的块大小(相同的总线程数)会显示更多的并行性? [英] Why smaller block size (same overall thread count) exposes more parallelism?

查看:137
本文介绍了为什么较小的块大小(相同的总线程数)会显示更多的并行性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Cheng等人撰写的专业CUDA C编程。并举例说明如何运行(非常简单的单行)内核,例如<<< 1024,512>>> << 2048,256>> 差。然后,他们指出(几次)您可能已经预期了此结果,因为第二次运行具有更多的块,因此暴露了更多的并行性。我不知道为什么。 SM中并行性的数量不是由并发扭曲的数量决定的吗?块大小与该块有什么关系-这些翘曲属于哪个块-相同的块或不同的块,所以为什么使用较小的块会显示更多的并行性(相反,如果块大小太大小我会达到每个SM限制的最大块数,从而导致更少的并发扭曲)?我可以设想的唯一情况是Fermi上有1024个线程的块= 32个扭曲,每个SM限制最多有48个并发扭曲。这意味着只能有1个并发块,并且只能有32个并发扭曲,从而减少了并行性,但这是一个非常特定的用例。

I'm reading "Professional CUDA C Programming" by Cheng et al. and there are examples of how a (very simple, single-line) kernel is being run for example with <<<1024, 512>>> performs worse than one with <<<2048, 256>>>. And then they state (several times) that you might have expected this result because the second run has more blocks and therefore exposes more parallelism. I can't figure out why though. Isn't the amount of parallelism governed by the number of concurrent warps in the SM? What does block size have to do with that - it doesn't matter to which block these warps belong to - the same block or different blocks, so why would using smaller blocks expose more parallelism (on the contrary, if the block size is too small I'd hit the max blocks per SM limit, resulting in fewer concurrent warps)? The only scenario I can envision is blocks of 1024 threads = 32 warps on Fermi, which has a max of 48 concurrent warps per SM limit. This means that only 1 concurrent block, and only 32 concurrent warps are possible, reducing the amount of parallelism, but that's a very specific use case.

UPDATE:
发布后我想到的另一件事:在SM中的所有翘曲都完成之前,无法将其从SM中逐出。因此,在该块的执行结束时,可能会出现以下情况:最后几个最慢的扭曲将SM中的整个块保持在SM中,而该块中的大多数扭曲都已完成并停滞了,但是新的块不能加载,直到那些很少的执行扭曲完成为止。因此,在这种情况下效率会降低。现在,如果块较小,则仍然会发生这种情况,但是相对于执行扭曲而言,停顿的数量较小,因此效率较高。是这个吗?

UPDATE: Another thing I thought of after posting: a block can not be evicted from the SM until all of the warps in it have finished. Thus, at the end of the execution of that block there could be a situation where a few last "slowest" warps are holding the entire block in the SM with most of the warps in that block finished and stalled, but a new block cannot be loaded until those few executing warps are finished. So in this case the efficiency becomes low. Now if the blocks are smaller then this will still happen, but the number of stalled relative to executing warps is smaller hence the efficiency is higher. Is this it?

推荐答案

是的,就是这个。您问题的第二段是一个很好的答案。

Yes, this is it. The second paragraph in your question is a good answer.

更详细地说,每个SM内部的warp调度程序的数量是有限的(通常为2)。每个warp调度程序都会跟踪许多活动的warp,并且仅在允许warp在程序中进一步移动的情况下,调度warp以便执行。经纱调度程序跟踪的活动经纱数量有最大值(通常为32)。由于线程块拥有的资源(例如共享内存)在所有warp完成之前无法释放给新线程块,因此,如果有几个warp,则较大的块大小可能会减少调度程序可用的候选活动warp数量。需要很长时间才能完成。由于资源闲置或SM无法覆盖内存访问的延迟,这可能导致性能降低。使用 __ syncthreads()或其变体之一在线程块之间进行同步时,更大的块大小也会增加翘曲阻塞的可能性,因此,可能导致类似的现象。

In more detail, the number of warp schedulers inside every SM is limited (usually 2). Each warp scheduler keeps track of a number of active warps, and schedules a warp for execution only if the warp is allowed to move further in the program. The number of active warps being tracked by a warp scheduler has a maximum (usually 32). Because the resources owned by the thread block (such as shared memory) cannot be released for a new thread block until all the warps finish, a large block size can cause reduced number of candidate active warps to be available to the scheduler if a few warps take a long time to finish. This can result in reduced performance either due to the resource idleness or the SM inability to cover the latency of memory accesses. Bigger block size also increases the probability of warp blockage when synchronizing across the thread block using __syncthreads() or one of its variations, therefore, may lead to a similar phenomenon.

这篇关于为什么较小的块大小(相同的总线程数)会显示更多的并行性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆