强制 CUDA 将寄存器用于变量 [英] Forcing CUDA to use register for a variable

查看:14
本文介绍了强制 CUDA 将寄存器用于变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的内核中有许多未使用的寄存器.我想告诉 CUDA 使用一些寄存器来保存一些数据,而不是每次需要时都读取全局数据.(我无法使用共享内存.)

I have many unused registers in my kernel. I'd like to tell CUDA to use a few registers to hold some data, rather than doing a global data read every time I need it. (I'm not able to use shared mem.)

__global__ void simple(float *gData) {
float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

编译 w/: nvcc -arch sm_20 --ptxas-options=-v simple.cu,我得到
0 字节堆栈帧,0 字节溢出存储,0 字节溢出加载
使用了 2 个寄存器,40 字节 cmem[0]

compile w/: nvcc -arch sm_20 --ptxas-options=-v simple.cu, and I get
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]

__global__ void simple(float *gData) {
register float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

register 声明什么都不做.
0 字节堆栈帧,0 字节溢出存储,0 字节溢出加载
使用了 2 个寄存器,40 字节 cmem[0]

register declaration does nothing.
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]

__global__ void simple(float *gData) {
volatile float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

volatile 声明创建堆栈存储:
4096 字节堆栈帧,0 字节溢出存储,0 字节溢出加载
使用了 21 个电阻,40 字节 cmem[0]

volatile declaration creates stack storage:
4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 21 resisters, 40 bytes cmem[0]

1) 有没有一种简单的方法可以告诉编译器为变量使用寄存器空间?
2)堆栈框架"在哪里:寄存器,全局内存,本地内存,...?什么是栈帧?(GPU 什么时候有堆栈?虚拟堆栈?)
3)simple.ptx文件基本为空:(nvcc -arch sm_20 -ptx simple.cu)

1) Is there an easy way to tell the compiler to use register space for a variable?
2) Where is 'stack frame': register, global mem, local mem,...? What is a stack frame? (Since when does the GPU have a stack? A virtual stack?)
3) The simple.ptx file is basically empty: (nvcc -arch sm_20 -ptx simple.cu)

.loc 2 14 2
ret;

知道在哪里可以找到真机/编译代码吗?

Any idea where I can find the real machine/compiled code?

推荐答案

  • 动态索引数组不能存储在寄存器中,因为 GPU 寄存器文件不可动态寻址.
  • 编译器会自动将标量变量存储在寄存器中.
  • 静态索引(即可以在编译时确定索引的位置),小型数组(例如,少于 16 个浮点数)可能 由编译器存储在寄存器中.
    • Dynamically indexed arrays cannot be stored in registers, because the GPU register file is not dynamically addressable.
    • Scalar variables are automatically stored in registers by the compiler.
    • Statically-indexed (i.e. where the index can be determined at compile time), small arrays (say, less than 16 floats) may be stored in registers by the compiler.
    • SM 2.0 GPU (Fermi) 仅支持每个线程最多 63 个寄存器.如果超过此值,寄存器值将从本地(片外)内存溢出/填​​充,由缓存层次结构支持.SM 3.5 GPU 将其扩展到每个线程最多 255 个寄存器.

      SM 2.0 GPUs (Fermi) only support up to 63 registers per thread. If this is exceeded, register values will be spilled/filled from local (off-chip) memory, supported by the cache hierarchy. SM 3.5 GPUs expand this to up to 255 registers per thread.

      一般来说,正如 Jared 所提到的,每个线程使用过多的寄存器是不可取的,因为它会降低占用率,从而降低内核中的延迟隐藏能力.GPU 在并行性上蓬勃发展,并通过使用来自其他线程的工作来覆盖内存延迟来实现这一点.

      In general, as Jared mentions, using too many registers per thread is not desireable because it reduces occupancy, and therefore reduces latency hiding ability in the kernel. GPUs thrive on parallelism and do so by covering memory latency with work from other threads.

      因此,您可能不应该将数组优化为寄存器.相反,请确保跨线程对这些数组的内存访问尽可能接近顺序,以便最大化合并(即最小化内存事务).

      Therefore, you should probably not optimize arrays into registers. Instead, ensure that your memory accesses to those arrays across threads are as close to sequential as possible so you maximize coalescing (i.e. minimize memory transactions).

      您给出的示例可能是共享内存的情况如果:

      The example you give may be a case for shared memory if:

      1. 块中的许多线程使用相同的数据,或者
      2. 每个线程的数组大小足够小,可以为多个线程块中的所有线程分配足够的空间(每个线程 1024 个浮点数非常多).

      正如 njuffa 所说,你的内核只使用 2 个寄存器的原因是因为你没有对内核中的数据做任何有用的事情,并且死代码都被编译器消除了.

      As njuffa mentioned, the reason your kernel only uses 2 registers is because you don't do anything useful with the data in the kernel, and the dead code was all eliminated by the compiler.

      这篇关于强制 CUDA 将寄存器用于变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆