哪些变量会消耗 CUDA 中的寄存器? [英] What kind of variables consume registers in CUDA?

查看:13
本文介绍了哪些变量会消耗 CUDA 中的寄存器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

__global__ void add( int *c, const int* a, const int* b )
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    int offset = x + y * gridDim.x;
    c[offset] = a[offset] + b[offset];
}

在上面的例子中,我猜 x, y, offset 保存在寄存器中,而

In the above example, I guess x, y, offset are saved in registers while

  • nvcc -Xptxas -v给出4个寄存器,24+16字节smem

profiler 显示 4 个寄存器

ptx文件的头部:

.reg .u16 %rh<4>;
.reg .u32 %r<9>;    
.reg .u64 %rd<10>;  
.loc    15  21  0   

$LDWbegin__Z3addPiPKiS1_:   
.loc    15  26  0  

谁能解释一下寄存器的用法?在 Fermi 中,每个线程的最大寄存器数为 63.在我的程序中,我想测试内核消耗太多寄存器的情况(因此变量可能必须自动存储在本地内存中,从而导致性能下降).然后此时我可以将一个内核分成两个,以便每个线程都有足够的寄存器.假设 SM 资源足够并发内核.

Can anyone clarify the usage of registers? In Fermi, the maximum number of registers is 63 for each thread. In my program I want to test the case when a kernel consumes too many registers (so variables may have to be stored in local memory automatically and thus leads to performance decrease). Then at this point I can split one kernel into two so that each thread has enough registers. Assume that the SM resources are sufficient for concurrent kernels.

我不确定我是否正确.

推荐答案

PTX中的寄存器分配与内核最终的寄存器消耗完全无关.PTX 只是最终机器码的中间表示,使用静态单一赋值形式,意味着每个PTX 中的寄存器只使用一次.一块有数百个寄存器的 PTX 可以编译成一个只有几个寄存器的内核.

The register allocation in PTX is completely irrelevant to the final register consumption of the kernel. PTX is only an intermediate representation of the final machine code and uses static single assignment form, meaning that each register in PTX is only used once. A piece of PTX with hundreds of registers can compile into a kernel with only a few registers.

寄存器分配由 ptxas 作为完全独立的编译过程完成(静态或由驱动程序即时执行,或两者兼而有之),它可以执行大量代码重新排序和优化输入 PTX 以提高吞吐量并节省寄存器,这意味着原始 C 中的变量或 PTX 中的寄存器与组装内核的最终寄存器计数之间几乎没有关系.

Register assignment is done by ptxas as a completely standalone compilation pass (either statically or just-in-time by the driver, or both) and it can perform a lot of code reordering and optimisations on the input PTX to improve throughput and conserve registers, meaning that there is little or no relationship between the variables in the original C or registers in PTX and the final register count of the assembled kernel.

nvcc 确实提供了一些方法来影响汇编器的寄存器分配行为.您有 __launch_bounds__ 向编译器提供启发式提示,这可能会影响寄存器分配,并且编译器/汇编器采用 -maxrregcount 参数(以寄存器溢出到本地为代价)内存,这会降低性能).volatile 关键字用于对旧版本的基于 nvopen64 的编译器产生影响,并可能影响本地内存溢出行为.但是你不能在原始的 C 代码或 PTX 汇编语言代码中任意控制或引导寄存器分配.

nvcc does provide some ways to influence the register allocation behaviour of the assembler. You have __launch_bounds__ to provide heuristic hints to the compiler which can influence register allocation, and the compiler/assembler takes the -maxrregcount argument (at the potential expense of register spilling to local memory, which can lower performance). The volatile keyword used to make a difference to older versions of the nvopen64 based compiler and could influence the local memory spill behaviour. But you can't arbitrarily control or steer register allocation in the original C code or PTX assembly language code.

这篇关于哪些变量会消耗 CUDA 中的寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆