Cuda PTX注册声明和使用 [英] Cuda PTX registers declaration and using

查看:598
本文介绍了Cuda PTX注册声明和使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试减少内核中使用寄存器的数量,所以我决定尝试内联的PTX。

I am trying to reduce number of using registers in my kernel, so I am decide to try inline PTX.

这个内核:

#define Feedback(a, b, c, d, e) d^e^(a&c)^(a&e)^(b&c)^(b&e)^(c&d)^(d&e)^(a&d&e)^(a&c&e)^(a&b&d)^(a&b&c)

__global__ void Test(unsigned long a, unsigned long b, unsigned long c, unsigned long d, unsigned long e, unsigned long f, unsigned long j, unsigned long h, unsigned long* res)
{
    res[0] = Feedback( a, b, c, d, e );  
    res[1] = Feedback( b, c, d, e, f );
    res[2] = Feedback( c, d, e, f, j );  
    res[3] = Feedback( d, e, f, j, h );
}  

使用14个寄存器,我认为这不仅仅是需要,所以我写入Inline PTX:

Using 14 registers, I am thinking this is more than needs, so I am write Inline PTX:

    __global__ void Feedback_ASM(unsigned long a, unsigned long b, unsigned long c, unsigned long d, unsigned long e, unsigned long f, unsigned long j, unsigned long h, unsigned long* res)
{
asm(".reg .u32 %r<10>;\n");

// 1
asm("ld.param.u32   %r1, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_a];\n"
    "ld.param.u32   %r2, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_b];\n"
    "ld.param.u32   %r3, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_c];\n"
    "ld.param.u32   %r4, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_d];\n"
    "ld.param.u32   %r5, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_e];\n");

asm("and.b32 %r7, %r1, %r3;\n"
    "xor.b32 %r8, %r7, %r4;\n"
    "xor.b32 %r7, %r8, %r5;\n"
    "and.b32 %r8, %r1, %r5;\n"
    "xor.b32 %r9, %r7, %r8;\n"
    .............................
    "xor.b32 %r8, %r7, %r9;\n"
    "and.b32 %r6, %r1, %r2;\n"
    "and.b32 %r7, %r6, %r3;\n"
    "xor.b32 %r9, %r7, %r8;\n");

asm("ld.param.u32   %r8, [__cudaparm__Z7Feedback_ASMmmmmmmmmPm_res];\n"
    "st.global.u32  [%r8+0], %r9;");     
// 2
...
// 3
...
// 4
...
}     

但是这个内核使用14个寄存器!我有点困惑。我声明只有10个寄存器,在ptx文件中没有其他变量。如何解决这种情况?

But this kernel uses 14 registers too! I am a little confused. I declared only 10 registers, In the ptx file there are no other variables. How I can solve this situation?

推荐答案

如前所述,PTX是一个中间代码。 PTX寄存器是虚拟寄存器,不一定反映实际的器件寄存器使用。

As indicated already, PTX is an intermediate code. PTX "registers" are virtual registers and don't necessarily reflect actual device register usage.

要想了解实际器件寄存器的使用,请使用ptxas verbose选项进行编译:

To get an idea of actual device register usage, compile using the ptxas verbose option:

nvcc -Xptxas -v ...

或使用其中一个分析器。您也可以直接使用以下方法检查机器代码:

or use one of the profilers. You can also inspect the machine code directly using:

cuobjdump -sass myexe

(其中 myexe 替换为可执行文件的名称)。

(where myexe is replaced with the name of your executable).

要控制寄存器的使用,可以使用nvcc编译选项:

To control register usage, you can use the nvcc compile option:

nvcc -maxrregcount 10 ...

(其中10替换为每个线程需要多少寄存器,您希望代码中的所有内核受限),或者您可以使用启动范围指令,这可以在逐个内核的基础上控制寄存器的使用。

(where 10 is replaced with how many registers per thread you want all kernels in your code to be limited to) or you can use the launch bounds directive in your code, which can control register usage on a kernel-by-kernel basis.

这篇关于Cuda PTX注册声明和使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆