每个线程的 Cuda 寄存器 [英] Cuda registers per thread

查看：26 发布时间：2022/1/10 15:59:19 cuda local-storage profiler

本文介绍了每个线程的 Cuda 寄存器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

据我了解，对于 2.x 计算能力设备，每个线程有 63 个寄存器限制.您知道计算能力为 1.3 的设备的每个线程的寄存器限制是多少吗?

As I understand correctly for the 2.x compute capability devices there's a 63 register limit per thread. Do you know which is the register limit per thread for devices of compute capability 1.3?

我有一个大内核，正在 GTX260 上进行测试.我很确定我使用了很多寄存器，因为内核非常复杂并且我需要很多局部变量.根据 Cuda 分析器，我的寄存器使用量是 63(静态 Smem 是 68，虽然我不太确定这意味着什么，动态 Smem 是 0)，虽然我很确定我有超过 63 个局部变量，所以我想编译器正在重用寄存器或将它们溢出到本地内存中.

I have a big kernel which I'm testing on a GTX260. I'm pretty sure I'm using a lot of registers since the kernel is very complex and I need a lot of local variables. According to the Cuda profiler my register usage is 63 (Static Smem is 68 although I'm not so sure what that means and dynamic Smem is 0), although I'm pretty sure I have more than 63 local variables, so I figured the compiler is reusing registers or spilling them into local memory.

现在我认为计算能力 1.3 的设备每个线程的寄存器限制高于 2.x 设备.我的猜测是编译器选择了 63 个限制，因为我使用的是 256 个线程块，在这种情况下，256*63 是 16128，而 256*64 是 16384，这是该设备的 SM 的寄存器限制数量.所以我的猜测是，如果我降低每个块的线程数，我可以增加正在使用的寄存器数量.所以我用 196 个线程块运行内核.但是即使 63*192 是 12096 并且 64*192 是 12288，这在 SM 的 16384 限制之内，分析器也会再次显示 63 个寄存器.

Now I thought the devices of compute capability 1.3 had a higher limit of registers per thread than the 2.x devices. My guess was that the compiler was choosing the 63 limit because I'm using using blocks of 256 threads in which case 256*63 is 16128 while 256*64 is 16384 which is the limit number of registers for a SM of this device. So my guess was that if I lower the number of threads per block I can increase the number of registers in use. So I ran the kernel with blocks of 196 threads. But again the profiler shows 63 registers even though 63*192 is 12096 and 64*192 is 12288 which is way inside the 16384 limit of the SM.

那么知道为什么编译器仍将自身限制为 63 个寄存器吗?可能是因为寄存器重用还是仍然溢出寄存器?

So any idea why the compiler is limiting itself still to 63 registers? Could it be all because of register reuse or is it still spilling registers?

每个线程的 Cuda 寄存器 [英] Cuda registers per thread

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

每个线程的 Cuda 寄存器 [英] Cuda registers per thread

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭