与CUDA PTX代码和寄存器存储器混淆 [英] Confusion with CUDA PTX code and register memory

查看:357
本文介绍了与CUDA PTX代码和寄存器存储器混淆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

:)
虽然我试图管理我的内核资源,我决定看看PTX,但有一些事情,我不明白。这是一个非常简单的内核我写的:

  __ global__ 
void foo(float * out,float * in,uint32_t n)
{
uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
uint32_t one = 5;
out [idx] = in [idx] + one;
}

然后我编译它: nvcc --ptxas -options = -v -keep main.cu ,我在控制台上得到了这个输出:

  ptxas info:0 bytes gmem 
ptxas info:为'sm_10'编译入口函数'_Z3fooPfS_j'
ptxas info:使用2个寄存器,36个字节smem
pre>

生成的ptx如下:

  .entry _Z3fooPfS_j(
.param .u64 __cudaparm__Z3fooPfS_j_out,
.param .u64 __cudaparm__Z3fooPfS_j_in,
.param .u32 __cudaparm__Z3fooPfS_j_n)
{
.reg .u16%rh 4 ;
.reg .u32%r 5;
.reg .u64%rd 8;
.reg .f32%f 5;
.loc 15 17 0
$ LDWbegin__Z3fooPfS_j:
.loc 15 21 0
mov.u16%rh1,%ctaid.x;
mov.u16%rh2,%ntid.x;
mul.wide.u16%r1,%rh1,%rh2;
cvt.u32.u16%r2,%tid.x;
add.u32%r3,%r2,%r1;
cvt.u64.u32%rd1,%r3;
mul.wide.u32%rd2,%r3,4;
ld.param.u64%rd3,[__cudaparm__Z3fooPfS_j_in];
add.u64%rd4,%rd3,%rd2;
ld.global.f32%f1,[%rd4 + 0];
mov.f32%f2,0f40a00000; // 5
add.f32%f3,%f1,%f2;
ld.param.u64%rd5,[__cudaparm__Z3fooPfS_j_out];
add.u64%rd6,%rd5,%rd2;
st.global.f32 [%rd6 + 0],%f3;
.loc 15 22 0
exit;
$ LDWend__Z3fooPfS_j:
} // _Z3fooPfS_j

现在有一些事情我不明白:




  • 根据ptx程序集,使用4 + 5 + 8 + 5 = 22个寄存器。那么为什么在编译期间使用了两个寄存器

  • 查看汇编我意识到threadId,blockId等的数据类型是 u16 。这是否在CUDA规范中定义?

  • 有人可以向我解释这行: mul.wide.u16%r1,%rh1,%rh2 ; %r1 u32 ,为什么 c $ c> u32 是什么?

  • 如何选择寄存器的名称?在我的花瓶中,我理解%r 部分,但我不明白 h ,(null) c $ c> d 部分。它是基于数据类型长度选择的吗?即:对于16位, h 为32位,空值为32位, d 为64位?

  • 如果我用这个 out [idx] = in [idx]; 替换我的内核的最后两行,那么当我编译程序时,它说使用3个寄存器!现在可以使用更多的寄存器?



请忽略我的测试内核不检查数组索引



非常感谢。

解决方案

PTX一种设计为跨多个GPU架构可移植的中间语言。它由编译器组件PTXAS编译成最终机器代码,也被称为SASS,用于特定架构。 nvcc选项 -Xptxas -v 导致PTXAS报告有关生成的机器代码的各种统计信息,包括机器代码中使用的物理寄存器数量。您可以通过使用 cuobjdump --dump-sass 反汇编来检查机器代码。



在PTX中使用的代码没有意义,因为这些是虚拟寄存器。 CUDA编译器以所谓的SSA形式生成PTX代码(静态单分配,请参见 http://en.wikipedia.org/wiki / Static_single_assignment_form )。这基本上意味着写入的每个新结果都被分配了一个新的寄存器。



指令 mul.wide PTX规范,其当前版本(3.1),您可以在这里找到: http:// docs .nvidia.com / cuda / parallel-thread-execution / index.html 。在您的示例代码中,后缀 .u16 意味着它将两个无符号的16位数量相乘,并返回一个无符号32位结果,即它计算完整的双倍宽度



PTX中的虚拟寄存器是打字的,但它们的名称可以自由选择,与类型无关。 CUDA编译器似乎遵循某些约定(据我所知),因为它们是内部实现工件。查看一堆PTX代码,很明显,当前生成的寄存器名称编码类型信息,这可以为了便于调试而进行: p 用于谓词,对于64位整数, r (对于32位整数), rd 对于32位浮点,$ c> f ,对于64位双精度, fd 。通过查看创建这些虚拟寄存器的PTX代码中的 .reg 指令,您可以很容易地看到这一点。


:) While I was trying to manage my kernel resources I decided to look into PTX but there are a couple of things that I do not understand. Here is a very simple kernel I wrote:

__global__
void foo(float* out, float* in, uint32_t n)
{
    uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    uint32_t one = 5;
    out[idx] = in[idx]+one;
}

Then I compiled it using: nvcc --ptxas-options=-v -keep main.cu and I got this output on the console:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z3fooPfS_j' for 'sm_10'
ptxas info    : Used 2 registers, 36 bytes smem

And the resulting ptx is the following:

    .entry _Z3fooPfS_j (
            .param .u64 __cudaparm__Z3fooPfS_j_out,
            .param .u64 __cudaparm__Z3fooPfS_j_in,
            .param .u32 __cudaparm__Z3fooPfS_j_n)
    {
    .reg .u16 %rh<4>;
    .reg .u32 %r<5>;
    .reg .u64 %rd<8>;
    .reg .f32 %f<5>;
    .loc    15  17  0
$LDWbegin__Z3fooPfS_j:
    .loc    15  21  0
    mov.u16     %rh1, %ctaid.x;
    mov.u16     %rh2, %ntid.x;
    mul.wide.u16    %r1, %rh1, %rh2;
    cvt.u32.u16     %r2, %tid.x;
    add.u32     %r3, %r2, %r1;
    cvt.u64.u32     %rd1, %r3;
    mul.wide.u32    %rd2, %r3, 4;
    ld.param.u64    %rd3, [__cudaparm__Z3fooPfS_j_in];
    add.u64     %rd4, %rd3, %rd2;
    ld.global.f32   %f1, [%rd4+0];
    mov.f32     %f2, 0f40a00000;        // 5
    add.f32     %f3, %f1, %f2;
    ld.param.u64    %rd5, [__cudaparm__Z3fooPfS_j_out];
    add.u64     %rd6, %rd5, %rd2;
    st.global.f32   [%rd6+0], %f3;
    .loc    15  22  0
    exit;
$LDWend__Z3fooPfS_j:
    } // _Z3fooPfS_j

Now there are some things that I don't understand:

  • According to the ptx assembly 4+5+8+5=22 registers are used. Then why it says used 2 registers during the compilation?
  • Looking at the assembly I realised that the data type of threadId, blockId etc is u16. Is this defined in the CUDA specification? Or this may vary between different versions of the CUDA driver?
  • Can someone explain to me this line: mul.wide.u16 %r1, %rh1, %rh2;? %r1 is u32, why wide instead of u32 is used?
  • How are the names of the registers chosen? In my vase I understand the %r part but I don't understand the h,(null),d part. Is it chosen based on the data type length? ie: h for 16bit, null for 32bit, d for 64bit?
  • If I replace the last 2 lines of my kernel with this out[idx] = in[idx];, then when I compile the program it says that 3 registers are used! How is it possible to use more registers now?

Please ignore the fact that my test kernel does not check if the array index is out of bounds.

Thank you very much.

解决方案

PTX is an intermediate language that is designed to be portable across multiple GPU architectures. It gets compiled by the compiler component PTXAS into final machine code, also refered to as SASS, for a particular architecture. The nvcc option -Xptxas -v causes PTXAS to report various statistics about the generated machine code, including the number of physical registers used in the machine code. You can inspect the machine code by disassembling it with cuobjdump --dump-sass.

So the number of registers one sees used in PTX code has no significance, since these are virtual registers. The CUDA compiler generates PTX code in what is known as SSA form (static single assignment, see http://en.wikipedia.org/wiki/Static_single_assignment_form). This basically means that each new result written is assigned a new register.

The instruction mul.wide is described in the PTX specification, the current version of which (3.1) you can find here: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html . In your example code, the suffix .u16 means that it multiplies two unsigned 16-bit quantities and returns an unsigned 32-bit result, i.e. it computes the full, double-width product of the source operands.

Virtual registers in PTX are typed, but their names can be chosen freely, independent of type. The CUDA compiler appears to follow certain conventions that are (to my knowledge) not documented since they are internal implementation artifacts. Looking at a bunch of PTX code it is clear that the register names currently generated encode type information, this may be done for ease of debugging: p<num> is used for predicates, r<num> for 32-bit integers, rd<num> for 64-bit integers, f<num> for 32-bit floats, and fd<num> for 64-bit doubles. You can easily see this for yourself by looking at the .reg directives in the PTX code that create these virtual registers.

这篇关于与CUDA PTX代码和寄存器存储器混淆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆