blockIdx是否与块执行顺序相关? [英] Is blockIdx correlated to the order of block execution?

查看:95
本文介绍了blockIdx是否与块执行顺序相关?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

blockIdx 与在GPU设备上执行线程块的顺序之间是否存在任何关系?



我的动机是我有一个内核,其中多个块将从全局内存中的同一位置读取,如果这些块可以同时运行,那会很好(因为L2缓存命中率很好)。在决定如何将这些块组织到网格中时,可以肯定地说 blockIdx.x = 0 更有可能与 blockIdx同时运行.x = 1 比使用 blockIdx.x = 200 ?而且我应该尝试为从全局内存中同一位置读取的块分配连续的索引?



为清楚起见,我并不是在问块间依赖性(例如


我们有40个块立即开始(每个SM 2个块* 20个SM),随后的块在


对于二维网格,我发现了相同的线性顺序,其中 blockIdx.x 是最快的维度和 blockIdx.y 慢维度:



注意:在标记这些图时,我犯了一个可怕的错字。所有 threadIdx实例


对于3维块网格:


结论


对于一维网格,这些结果与Pai博士在链接的文章中报告的结果相匹配。但是,对于二维网格,我没有找到任何关于块执行顺序中空间填充曲线的证据,因此,这可能在费米和帕斯卡之间有所改变。


当然,通常会有基准测试的警告,并且不能保证这并非特定于特定的处理器型号。


附录


作为参考,这是显示随机和固定运行时结果的图:



我们看到随机运行时的趋势,这使我更有信心这是一个真实的结果,而不仅仅是基准测试任务的怪癖。


Is there any relationship between blockIdx and the order in which thread blocks are executed on the GPU device?

My motivation is that I have a kernel in which multiple blocks will read from the same location in global memory, and it would be nice if these blocks would run concurrently (because L2 cache hits are nice). In deciding how to organize these blocks into a grid, would it be safe to say that blockIdx.x=0 is more likely to run concurrently with blockIdx.x=1 than with blockIdx.x=200? And that I should try to assign consecutive indices to blocks that read from the same location in global memory?

To be clear, I'm not asking about inter-block dependencies (as in this question) and the thread blocks are completely independent from the point of view of program correctness. I'm already using shared memory to broadcast data within a block, and I can't make the blocks any larger.

EDIT: Again, I am well aware that

Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series.

and the blocks are fully independent---they can run in any order and produce the same output. I am just asking if the order in which I arrange the blocks into a grid will influence which blocks end up running concurrently, because that does affect performance via L2 cache hit rate.

解决方案

I found a writeup in which a CS researcher used micro-benchmarking to reverse engineer the block scheduler on a Fermi device:

http://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html

I adapted his code to run on my GPU device (GTX 1080, with the Pascal GP104 GPU) and to randomize the runtimes.

Methods

Each block contains only 1 thread, and is launched with enough shared memory that only 2 blocks can be resident per SM. The kernel records its start time (obtained via clock64()) and then runs for a random amount of time (the task, appropriately enough, is generating random numbers using the multiply-with-carry algorithm).

The GTX 1080 is comprised of 4 Graphics Processing Clusters (GPCs) with 5 streaming multiprocessors (SM) each. Each GPC has its own clock, so I used the same method described in the link to determine which SMs belonged to which GPCs and then subtract a fixed offset to convert all of the clock values to the same time zone.

Results

For a 1-D block grid, I found that the blocks were indeed launched in consecutive order:

We have 40 blocks starting immediately (2 blocks per SM * 20 SMs) and the subsequent blocks start when the previous blocks end.

For 2-D grids, I found the same linear-sequential order, with blockIdx.x being the fast dimension and blockIdx.y the slow dimension:

NB: I made a terrible typo when labeling these plots. All instances of "threadIdx" should be replaced with "blockIdx".

And for a 3-d block grid:

Conclusions

For a 1-D grid, these results match what Dr. Pai reported in the linked writeup. For 2-D grids, however, I did not find any evidence for a space-filling curve in block execution order, so this may have changed somewhere between Fermi and Pascal.

And of course, the usual caveats with benchmarking apply, and there's no guarantee that this isn't specific to a particular processor model.

Appendix

For reference, here's a plot showing the results for random vs. fixed runtimes:

The fact that we see this trend with randomized runtimes gives me more confidence that this is a real result and not just a quirk of the benchmarking task.

这篇关于blockIdx是否与块执行顺序相关?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆