我应该在什么时候使用CUDA的内置warpSize,而不是我自己的常数? [英] When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

查看:2685
本文介绍了我应该在什么时候使用CUDA的内置warpSize,而不是我自己的常数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

nvcc设备代码可以访问内置值 warpSize ,该值设置为执行内核的设备的warp大小(即32,对于可预见的未来)。通常你不能告诉它除了一个常数 - 但如果你试图声明一个长度warpSize的数组你得到一个投诉,它是非常数...(使用CUDA 7.5)



所以,至少为了这个目的,你有动机去(编辑):

 code> enum:unsigned int {warp_size = 32}; 

但现在 - 我应该选择什么,什么时候? : warpSize warp_size

编辑: warpSize 显然是PTX中的编译时常数。

解决方案

与talonmies的回答相反,我发现 warp_size 常数完全可以接受。使用 warpSize 的唯一原因是使代码与可能具有不同大小的卷曲的可能的未来硬件前向兼容。然而,当这样的硬件到达时,内核代码将很可能还需要其他改变以便保持有效。 CUDA不是硬件无关的语言 - 相反,它仍然是一个低级编程语言。生产代码使用了随着时间的推移而来的各种内在函数(例如 __ umul24 )。



不同的经纱尺寸(例如64)很多东西会改变:




  • warpSize

  • 许多扭曲层本质将需要他们的签名调整,或一个新版本产生,例如 int __ballot ,而 int 不需要是32位,最常见的是这样!

  • 迭代操作,例如翘曲级减少,需要调整迭代次数。我从来没有见过任何人写:

      for(int i = 0; i   

    在通常是时间关键的代码段中过于复杂。


  • warpIdx laneIdx threadIdx 需要调整。目前,我看到的最典型的代码是:

      warpIdx = threadIdx.x / 32; 
    laneIdx = threadIdx.x%32;

    这简化了右移和掩码操作。但是,如果用 warpSize 替换 32 ,这会突然变成一个相当昂贵的操作!




同时,在代码中使用 warpSize 会阻止优化,因为正式地编译时已知常量。
此外,如果共享内存的数量取决于 warpSize ,这迫使你使用动态分配的shmem(按照talonmies的答案)。但是,其语法是不方便使用,特别是当你有几个数组 - 这迫使你自己做指针算术和手动计算所有内存使用的总和。



使用模板为 warp_size 是一个部分解决方案,但在每个函数调用添加一层语法复​​杂性需要:

  deviceFunction< warp_size>(params)

代码。






我的建议是有一个单一的标题,控制所有特定于模型的常数,例如

  #if __CUDA_ARCH__< = 600 
//所有设备capability< = 6.0
static const int warp_size = 32;
#endif

现在,您的CUDA代码的其余部分可以使用它,没有任何语法开销。您决定添加对新架构的支持的那一天,您只需要更改这一段代码。


nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)

So, at least for that purpose you are motivated to have something like (edit):

enum : unsigned int { warp_size  = 32 };

somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?

Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.

解决方案

Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).

The day we get a different warp size (e.g. 64) many things will change:

  • The warpSize will have to be adjusted obviously
  • Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
  • Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:

    for (int i = 0; i < log2(warpSize); ++i) ...
    

    that would be overly complex in something that is usually a time-critical piece of code.

  • warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:

    warpIdx = threadIdx.x/32;
    laneIdx = threadIdx.x%32;
    

    which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!

At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant. Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.

Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:

deviceFunction<warp_size>(params)

This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.


My suggestion would be to have a single header that control all the model-specific constants, e.g.

#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32; 
#endif

Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

这篇关于我应该在什么时候使用CUDA的内置warpSize,而不是我自己的常数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆