多维CUDA块网格的动机 [英] The motivation for multidimensional CUDA block grid

查看:94
本文介绍了多维CUDA块网格的动机的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我与提出的问题基本上是相同的在此讨论中。特别是,我想参考这个最终答复:


我认为在
线程中有两个不同的问题混在一起:


  1. 使用输入或输出数据到线程的2D或3D映射对性能有好处吗?对于您和其他人所描述的所有
    原因,答案都是绝对。如果数据或计算具有
    空间局部性,则将工作分配到
    a经线中的线程也应如此。


  2. 是否存在通过使用CUDA的多维网格执行此工作分配而获得性能收益?在这种情况下,我不认为既然
    可以在
    内核的顶部自行完成索引计算。这会消耗一些算术指令,但是与内核启动开销相比,
    应该可以忽略不计。


这就是为什么我认为多维网格旨在为程序员提供
的便利,而不是提高性能的方式。您
确实确实需要考虑每个扭曲的内存访问模式,尽管


我想知道如果今天这种情况仍然存在。我想知道为什么需要多维外部网格的原因。



我想了解的是,这个目的是否有意义(例如,从空间局部性中获得的实际收益)还是为了方便(例如在图像处理上下文中,是否只是为了使CUDA知道特定块正在处理的x / y补丁,以便可以将其报告给CUDA Visual Profiler或其他东西?



第三个选择是,这仅是对较早版本CUDA的保留,它是硬件索引限制的解决方法。

解决方案

使用多维网格绝对有好处。不同的条目(tid,ctaid)是只读变量,可显示为特殊寄存器。请参见 PTX ISA


PTX包含许多预定义的只读变量,这些变量可以作为特殊寄存器查看,并可以通过mov或cvt指令进行访问。
特殊寄存器为:

 %tid 
%ntid
%laneid
%warpid
%nwarpid
%ctaid
%nctaid


如果某些数据无需进一步处理就可以使用,则不仅会获得算术指令-可能在多维数据的每个索引步骤,而且更重要的是,您节省了寄存器,这在任何硬件上都是非常稀缺的资源。 / p>

I have basically the same question as posed in this discussion. In particular I want to refer to this final response:

I think there are two different questions mixed together in this thread:

  1. Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is "absolutely" for all the reasons you and others have described. If the data or calculation has spatial locality, then so should the assignment of work to threads in a warp.

  2. Is there a performance benefit to using CUDA's multidimensional grids to do this work assignment? In this case, I don't think so since you can do the index calculation trivially yourself at the top of the kernel. This burns a few arithmetic instructions, but that should be negligible compared to the kernel launch overhead.

This is why I think the multidimensional grids are intended as a programmer convenience rather than a way to improve performance. You do absolutely need to think about each warp's memory access patterns, though.

I want to know if this situation still holds today. I want to know the reason why there is a need for a multidimensional "outer" grid.

What I'm trying to understand is whether or not there is a significant purpose to this (e.g. an actual benefit from spatial locality) or is it there for convenience (e.g. in an image processing context, is it there only so that we can have CUDA be aware of the x/y "patch" that a particular block is processing so it can report it to the CUDA Visual Profiler or something)?

A third option is that this nothing more than a holdover from earlier versions of CUDA where it was a workaround for hardware indexing limits.

解决方案

There is definitely a benefit in the use of multi-dimensional grid. The different entries (tid, ctaid) are read-only variables visible as special registers. See PTX ISA

PTX includes a number of predefined, read-only variables, which are visible as special registers and accessed through mov or cvt instructions. The special registers are:

 %tid
 %ntid
 %laneid
 %warpid
 %nwarpid
 %ctaid
 %nctaid

If some of this data may be used without further processing, not-only you may gain arithmetic instructions - potentially at each indexing step of multi-dimension data, but more importantly you are saving registers which is a very scarce resource on any hardware.

这篇关于多维CUDA块网格的动机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆