Cuda每个线程寄存器 [英] Cuda registers per thread

查看:464
本文介绍了Cuda每个线程寄存器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如我对2.x计算能力设备的正确理解,每个线程有63个寄存器限制。你知道对于计算能力1.3的设备,每个线程的寄存器限制是多少?



我有一个大内核,我在GTX260上测试。我很确定我使用了很多寄存器,因为内核是非常复杂的,我需要很多局部变量。根据Cuda profiler我的注册表使用率是63(静态Smem是68虽然我不太确定这意味着什么,动态Smem是0),虽然我很肯定我有超过63个局部变量,所以我想到了编译器重用寄存器或将它们溢出到本地内存中。



现在我认为计算能力1.3的设备每个线程的寄存器的限制比2.x设备更高。我的猜测是,编译器选择63限制,因为我使用256个线程的块,在这种情况下256 * 63是16128,而256 * 64是16384这是此设备的SM的寄存器的极限数。所以我的猜测是,如果我降低每个块的线程数量,我可以增加使用的寄存器的数量。所以我用196个线程的块运行内核。但是同样,分析器显示63个寄存器,即使63 * 192是12096和64 * 192是12288,这是在SM的16384限制内的方式。



编译器本身仍然限制到63个寄存器?

每个线程的最大值寄存器记录在 http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-technical-specifications.xml\">在这里



cc 2.x和3.0为63,cc 1.x为128,cc 3.5为255



编译器可能已决定63个寄存器足够,并且没有用于额外的寄存器。寄存器可以重用,所以只是因为你有很多局部变量,并不一定意味着每个线程的寄存器必须很高。



我的建议是使用nvcc -maxrregcount 选项来指定各种限制,然后使用 -Xptxas -v 选项让编译器告诉您有多少个寄存器在创建PTX时使用。


As I understand correctly for the 2.x compute capability devices there's a 63 register limit per thread. Do you know which is the register limit per thread for devices of compute capability 1.3?

I have a big kernel which I'm testing on a GTX260. I'm pretty sure I'm using a lot of registers since the kernel is very complex and I need a lot of local variables. According to the Cuda profiler my register usage is 63 (Static Smem is 68 although I'm not so sure what that means and dynamic Smem is 0), although I'm pretty sure I have more than 63 local variables, so I figured the compiler is reusing registers or spilling them into local memory.

Now I thought the devices of compute capability 1.3 had a higher limit of registers per thread than the 2.x devices. My guess was that the compiler was choosing the 63 limit because I'm using using blocks of 256 threads in which case 256*63 is 16128 while 256*64 is 16384 which is the limit number of registers for a SM of this device. So my guess was that if I lower the number of threads per block I can increase the number of registers in use. So I ran the kernel with blocks of 196 threads. But again the profiler shows 63 registers even though 63*192 is 12096 and 64*192 is 12288 which is way inside the 16384 limit of the SM.

So any idea why the compiler is limiting itself still to 63 registers? Could it be all because of register reuse or is it still spilling registers?

解决方案

max registers per thread is documented here

It is 63 for cc 2.x and 3.0, 128 for cc 1.x and 255 for cc 3.5

The compiler may have decided that 63 registers is enough, and doesn't have use for additional registers. Registers can be reused, so just because you have a lot of local variables, doesn't necessarily mean that the registers per thread has to be high.

My suggestion would be to use the nvcc -maxrregcount option to specify various limits, and then use the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX.

这篇关于Cuda每个线程寄存器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆