强制CUDA使用变量的寄存器 [英] Forcing CUDA to use register for a variable

查看:320
本文介绍了强制CUDA使用变量的寄存器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的内核中有许多未使用的寄存器。我想告诉CUDA使用几个寄存器来保存一些数据,而不是每次我需要读取全局数据。 (我无法使用共享内存。)

  __ global__ void simple(float * gData){
float rData [1024];
for(int i = 0; i <1024; i ++){
rData [i] = gData [i];
}
//处理这里的数据
}

编译w /:nvcc -arch sm_20 --ptxas-options = -v simple.cu,我得到

0字节堆栈帧,0字节溢出存储,0字节溢出加载

使用2个寄存器,40个字节cmem [0]

  __ global__ void simple(float * gData){
register float rData [1024];
for(int i = 0; i <1024; i ++){
rData [i] = gData [i];
}
//处理这里的数据
}

register 声明无效。

0字节堆栈帧,0字节溢出存储,0字节溢出加载

使用2个寄存器,40字节cmem [0]

  __ global__ void simple(float * gData){
volatile float rData [1024]
for(int i = 0; i <1024; i ++){
rData [i] = gData [i];
}
//处理这里的数据
}

volatile 声明创建堆栈存储:

4096字节堆栈帧,0字节溢出存储,0字节溢出加载

使用21个寄存器,40字节cmem [0 ]



1)有一个简单的方法告诉编译器为变量使用寄存器空间?

2)其中'stack frame' :register,global mem,local mem,...?什么是堆栈帧? (因为GPU何时有堆栈?虚拟堆栈?)

3)simple.ptx文件基本上是空的:(nvcc -arch sm_20 -ptx simple.cu)

  .loc 2 14 2 
ret;

任何想法在哪里可以找到真正的机器/编译代码?

解决方案


  • 动态索引数组不能存储在寄存器中,因为GPU寄存器文件不是可动态寻址的。

  • 标量变量由编译器自动存储在寄存器中。

  • 静态索引(即可在编译时确定索引),小数组)可由编译器存储在寄存器中。



SM 2.0 GPU(Fermi)每个线程63个寄存器。如果超过这个值,寄存器值将从缓存层次结构支持的本地(片外)内存溢出/填​​充。 SM 3.5 GPU扩展到每个线程最多255个寄存器。



一般来说,Jared提到,每个线程使用太多的寄存器是不可取的,因为它减少了占用率,因此减少了内核中的延迟隐藏能力。 GPU通过用其他线程的工作覆盖内存延迟来实现并行性。



因此,你可能不应该将数组优化到寄存器中。相反,请确保您的内存访问这些数组跨线程尽可能接近顺序,所以你最大限度地合并(即最小化内存事务)。



您提供的示例可能是共享内存的情况



<
  • 该块中的许多线程使用相同的数据,或

  • 每线程数组大小足够小,足以为多个线程块中的所有线程分配足够的空间


  • 正如njuffa所说,你的内核只使用2个寄存器的原因是因为你没有对内核中的数据做任何有用的事情,死代码已被编译器消除。


    I have many unused registers in my kernel. I'd like to tell CUDA to use a few registers to hold some data, rather than doing a global data read every time I need it. (I'm not able to use shared mem.)

    __global__ void simple(float *gData) {
    float rData[1024];
    for(int i=0; i<1024; i++) {
      rData[i]=gData[i];
      }
    // work on the data here
    }
    

    compile w/: nvcc -arch sm_20 --ptxas-options=-v simple.cu, and I get
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    Used 2 registers, 40 bytes cmem[0]

    __global__ void simple(float *gData) {
    register float rData[1024];
    for(int i=0; i<1024; i++) {
      rData[i]=gData[i];
      }
    // work on the data here
    }
    

    register declaration does nothing.
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    Used 2 registers, 40 bytes cmem[0]

    __global__ void simple(float *gData) {
    volatile float rData[1024];
    for(int i=0; i<1024; i++) {
      rData[i]=gData[i];
      }
    // work on the data here
    }
    

    volatile declaration creates stack storage:
    4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    Used 21 resisters, 40 bytes cmem[0]

    1) Is there an easy way to tell the compiler to use register space for a variable?
    2) Where is 'stack frame': register, global mem, local mem,...? What is a stack frame? (Since when does the GPU have a stack? A virtual stack?)
    3) The simple.ptx file is basically empty: (nvcc -arch sm_20 -ptx simple.cu)

    .loc 2 14 2
    ret;
    

    Any idea where I can find the real machine/compiled code?

    解决方案

    • Dynamically indexed arrays cannot be stored in registers, because the GPU register file is not dynamically addressable.
    • Scalar variables are automatically stored in registers by the compiler.
    • Statically-indexed (i.e. where the index can be determined at compile time), small arrays (say, less than 16 floats) may be stored in registers by the compiler.

    SM 2.0 GPUs (Fermi) only support up to 63 registers per thread. If this is exceeded, register values will be spilled/filled from local (off-chip) memory, supported by the cache hierarchy. SM 3.5 GPUs expand this to up to 255 registers per thread.

    In general, as Jared mentions, using too many registers per thread is not desireable because it reduces occupancy, and therefore reduces latency hiding ability in the kernel. GPUs thrive on parallelism and do so by covering memory latency with work from other threads.

    Therefore, you should probably not optimize arrays into registers. Instead, ensure that your memory accesses to those arrays across threads are as close to sequential as possible so you maximize coalescing (i.e. minimize memory transactions).

    The example you give may be a case for shared memory if:

    1. Many threads in the block use the same data, or
    2. The per-thread array size is small enough to allocate enough space for all threads in multiple thread blocks (1024 floats per thread is far much).

    As njuffa mentioned, the reason your kernel only uses 2 registers is because you don't do anything useful with the data in the kernel, and the dead code was all eliminated by the compiler.

    这篇关于强制CUDA使用变量的寄存器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆