当存在寄存器溢出机制时,为什么由于过多的寄存器使用而无法启动内核? [英] Why cannot a kernel be launched with the reason of too many register use when there is a register spilling mechanism?

查看:95
本文介绍了当存在寄存器溢出机制时,为什么由于过多的寄存器使用而无法启动内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

1)内核何时开始将寄存器溢出到本地内存中?

2)如果没有足够的寄存器,CUDA运行时如何决定不启动内核并抛出过多的资源请求错误?有多少个寄存器足以启动内核?

3)由于存在寄存器溢出机制,即使没有足够的寄存器,也不应该启动所有CUDA内核吗?

解决方案

1)内核何时开始将寄存器溢出到本地内存中?

这完全在编译器的控制之下.它不是由运行时执行的,也没有关于它的动态运行时决策.当您的代码到达溢出点时,这意味着编译器已插入如下指令:

  STL [R0],R1 

在这种情况下, R1 被存储到本地内存中,本地内存地址在 R0 中给出.这将是一个溢出的商店.(在该指令之后, R1 可以用于/加载其他内容.)当然,编译器知道何时完成此操作,因此它可以报告溢出负载的数量并将其存储起来.选择使用/制作.您可以使用 -Xptxas = -v 编译器开关获取此信息(以及寄存器使用情况和其他信息).

编译器(除非您对其加以限制,请参阅下文)会就寄存器使用情况做出决定,而决定主要着眼于性能,否则就较少关注实际使用了多少个寄存器.性能是第一要务.

2)如果没有足够的寄存器,CUDA运行时如何决定不启动内核并抛出过多的资源请求错误?有多少个寄存器足以启动内核?

在编译时,当编译内核代码时,编译器不知道如何启动它.它不知道您的启动配置是什么样的(块数,每个块的线程数,动态分配的共享内存的数量,等等)实际上,编译过程通常会像被编译的对象是单个线程一样进行./p>

在编译期间,编译器会做出一系列有关寄存器分配(如何使用寄存器以及在何处使用寄存器)的静态决策.CUDA具有二进制实用程序,可以帮助您理解这一点.寄存器分配在运行时不会改变,也不以任何方式动态变化,因此完全在编译时确定.因此,在完成给定设备代码功能的编译时,通常可以确定需要多少个寄存器.编译器将此信息包含在二进制编译对象中.

在运行时,在内核启动时,CUDA运行时现在知道:

  • 给定内核需要多少个寄存器(每个线程)
  • 我们正在运行什么设备,因此总限制是多少
  • 启动配置是什么(块,线程)

组装这3条信息,意味着运行时可以立即知道是否有足够的寄存器空间"用于启动.粗略地说,通过/失败算法是启动是否可以满足以下不等式:

  registers_per_thread * threads_per_block< = max_registers_per_multiprocessor 

在此等式中也要考虑粒度.寄存器通常在运行时以2或4为一组分配,即,在应用不平等测试之前,可能需要将 registers_per_thread 数量四舍五入到2或4之类的下一个整数倍..如上所述, registers_per_thread 的数量由编译器确定. threads_per_block 数量来自您的内核启动配置. max_registers_per_multiprocessor 数量是机器可读的(即,它是您所运行的GPU的功能).通过研究 deviceQuery CUDA示例代码,您可以查看自己如何检索该数量.

3)由于存在寄存器溢出机制,即使没有足够的寄存器,也不应该启动所有CUDA内核吗?

我重申寄存器分配(和寄存器溢出决定)完全是静态的编译时过程.没有运行时决策或更改.寄存器分配完全可以从编译后的代码中检查.因此,由于无法在运行时进行调整,因此无法进行更改以允许任意启动.任何此类更改都需要重新编译代码.尽管从理论上讲这可能是可行的,但目前尚未在CUDA中实现.此外,它有可能导致可变的行为,并且可能导致不可预测的行为(在性能上),因此可能有理由不这样做.

通过适当地限制编译器在寄存器分配方面的选择,可以使所有内核可启动"(就寄存器限制而言). __ launch_bounds __ 占用率计算器占用API 来帮助完成此过程.

1) When does a kernel start to spill registers to local memory?

2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?

3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?

解决方案

1) When does a kernel start to spill registers to local memory?

This is entirely under control of the compiler. It is not performed by the runtime, and there are no dynamic runtime decisions about it. When your code reaches the point of a spill, it means that the compiler has inserted an instruction like:

STL  [R0], R1

In this case, R1 is being stored to local memory, the local memory address given in R0. This would be a spill store. (After that instruction, R1 could be used for/loaded with something else.) The compiler knows when it has done this, of course, and so it can report the number of spill loads and spill stores it has chosen to use/make. You can get this information (along with register usage, and other information) using the -Xptxas=-v compiler switch.

The compiler (unless you restrict it, see below) makes decisions about register usage primarily focused on performance, paying otherwise less attention to how many registers are actually used. The first priority is performance.

2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?

At compile-time, when your kernel code is being compiled, the compiler has no idea how it will be launched. It has no idea what your launch configuration will be like (number of blocks, number of threads per block, amount of dynamically allocated shared memory, etc) In fact the compilation process mostly proceeds as if the thing being compiled is a single thread.

During compilation, the compiler makes a bunch of static decisions about register assignments (how and where registers will be used). CUDA has binary utilities that can help with understanding this. Register assignments don't change at runtime, are not in any way dynamic, and therefore are entirely determined at compile time. Therefore, at the completion of compilation for a given device code function, it is generally possible to determine how many registers are needed. The compiler includes this information in the binary compiled object.

At runtime, at the point of kernel launch, the CUDA runtime now knows:

  • How many registers (per thread) are needed for a given kernel
  • What device we are running on, and therefore what the aggregate limits are
  • What the launch configuration is (blocks, threads)

Assembling these 3 pieces of information means the runtime can immediately know if there is or will be enough "register space" for the launch. Roughly speaking, the pass/fail arithmetic is if the launch would satisfy this inequality:

 registers_per_thread*threads_per_block <= max_registers_per_multiprocessor

There is granularity to be considered in this equation as well. Registers are often allocated in groups of 2 or 4 at runtime, i.e. the registers_per_thread quantity may need to be rounded up to the next whole-number multiple of something like 2 or 4, before the inequality test is applied. The registers_per_thread quantity is ascertained by the compiler as already described. The threads_per_block quantity comes from your kernel launch configuration. The max_registers_per_multiprocessor quantity is machine-readable (i.e. it is a function of the GPU you are running on). You can see how to retrieve that quantity yourself if you wish by studying the deviceQuery CUDA sample code.

3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?

I reiterate that the register assignment (and register spill decisions) is/are entirely a static compile-time process. No runtime decisions or alterations are made. The register assignment is entirely inspectable from the compiled code. Therefore, since no adjustments can be made at runtime, no changes could be made to allow an arbitrary launch. Any such change would require recompilation of the code. While this might be theoretically possible, it is not currently implemented in CUDA. Furthermore, it has the possibility to lead to both variable and perhaps unpredictable behavior (in performance) so there might be reasons not to do it.

Its possible to make all kernels "launchable" (with respect to register limitations) by suitably restricting the compiler's choices about register assignment. __launch_bounds__ and the compiler switch -maxrregcount are a couple ways to achieve this. CUDA provides both an occupancy calculator as well as an occupancy API to help with this process.

这篇关于当存在寄存器溢出机制时,为什么由于过多的寄存器使用而无法启动内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆