什么样的变量在CUDA中消耗寄存器? [英] What kind of variables consume registers in CUDA?

查看:313
本文介绍了什么样的变量在CUDA中消耗寄存器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

__global__ void add( int *c, const int* a, const int* b )
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    int offset = x + y * gridDim.x;
    c[offset] = a[offset] + b[offset];
}



在上面的例子中,我猜 code>, y offset 保存在寄存器中,


  • nvcc -Xptxas -v 给出 4个寄存器,24 + 16个字节smem

分析器显示4个寄存器

ptx 档案:

.reg .u16 %rh<4>;
.reg .u32 %r<9>;    
.reg .u64 %rd<10>;  
.loc    15  21  0   

$LDWbegin__Z3addPiPKiS1_:   
.loc    15  26  0  


任何人都可以澄清寄存器的使用?在费米,每个线程的最大寄存器数为63。在我的程序中,我想测试一个内核消耗太多寄存器的情况(因此变量可能必须自动存储在本地内存中,从而导致性能下降)。然后在这一点上,我可以将一个内核分成两个,以便每个线程都有足够的寄存器。假设SM资源对于并发内核是足够的。

Can anyone clarify the usage of registers? In Fermi, the maximum number of registers is 63 for each thread. In my program I want to test the case when a kernel consumes too many registers (so variables may have to be stored in local memory automatically and thus leads to performance decrease). Then at this point I can split one kernel into two so that each thread has enough registers. Assume that the SM resources are sufficient for concurrent kernels.

我不确定我是否对。

推荐答案

PTX中的寄存器分配与内核的最终寄存器消耗完全无关。 PTX只是最终机器代码的中间表示,并使用静态单一分配表单,这意味着每个PTX中的寄存器只使用一次。一个带有几百个寄存器的PTX可以编译成只有几个寄存器的内核。

The register allocation in PTX is completely irrelevant to the final register consumption of the kernel. PTX is only an intermediate representation of the final machine code and uses static single assignment form, meaning that each register in PTX is only used once. A piece of PTX with hundreds of registers can compile into a kernel with only a few registers.

寄存器赋值由 ptxas 作为完全独立的编译遍(由驱动程序静态或及时,或两者兼有),并且它可以对输入PTX执行大量代码重新排序和优化,以提高吞吐量和节省寄存器,这意味着在PTX中的原始C或寄存器中的变量与组装的内核的最终寄存器计数之间很少或没有关系。

Register assignment is done by ptxas as a completely standalone compilation pass (either statically or just-in-time by the driver, or both) and it can perform a lot of code reordering and optimisations on the input PTX to improve throughput and conserve registers, meaning that there is little or no relationship between the variables in the original C or registers in PTX and the final register count of the assembled kernel.

nvcc 提供了一些方法来影响汇编器的寄存器分配行为。您有 __ launch_bounds __ 可向编译器提供启发式提示,这可能会影响寄存器分配,编译器/汇编器需要 -maxrregcount 参数(寄存器溢出到本地内存的潜在代价,这可能降低性能)。 volatile关键字用于使基于nvopen64的编译器的旧版本有所不同,并且可能会影响本地内存溢出行为。但是您不能在原始C代码或PTX汇编语言代码中任意控制或引导寄存器分配。

nvcc does provide some ways to influence the register allocation behaviour of the assembler. You have __launch_bounds__ to provide heuristic hints to the compiler which can influence register allocation, and the compiler/assembler takes the -maxrregcount argument (at the potential expense of register spilling to local memory, which can lower performance). The volatile keyword used to make a difference to older versions of the nvopen64 based compiler and could influence the local memory spill behaviour. But you can't arbitrarily control or steer register allocation in the original C code or PTX assembly language code.

这篇关于什么样的变量在CUDA中消耗寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆