调用__device__函数是否会影响CUDA中使用的寄存器数量? [英] Does calling __device__ functions impact the number of registers used in CUDA?

查看:160
本文介绍了调用__device__函数是否会影响CUDA中使用的寄存器数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在许多地方都读到 __ device __ 函数几乎总是由CUDA编译器内联的。那么,当我将代码从内核移到由



例如,以下代码片段使用相同数量的寄存器吗?它们是否同样有效?



片段1

  __ global__ void manualInlined(float * A,float * B,float * C,float * D,float * E){
//操纵A,B,C,D和E的代码
}

SNIPPET 2

  __ device__ void fn(float * A,float * B,float * C,float * D,float * E){
//操纵A,B的代码,C,D和E
}


__global__手动无效内联(float * A,float * B,float * C,float * D,float * E){
fn(A,B,C,D,E);
}


解决方案

最终答案只能是使用工具确定(使用 -Xptxas -v 进行编译,或使用探查器之一),但是通常的答案是调用 __ device __ 函数 会影响所用寄存器的数量(以及性能和效率)。



取决于您的文件组织以及您如何编译代码, __ device __ 函数可能是。如果是内联的,通常会提供优化的编译器( ptxas ,主要是)视情况调整寄存器用法的最佳机会。 (请注意,至少从理论上讲,这种适应可能导致 使用更多或更少的寄存器。但是,内联的情况通常会导致编译器同时使用较少的寄存器和更高的性能。但是

另一方面,如果未内联,则必须编译器主要进行优化以提高性能,而不减少寄存器的使用。)作为普通函数调用处理。像许多其他计算机体系结构一样,函数调用涉及建立堆栈框架以传递变量,然后将控制权转移给函数。在这种情况下,编译器受到更多限制,因为:


  1. 它必须将函数使用的变量移入/移出堆栈框架

  2. 它不能基于环绕代码执行其他优化,因为它不知道周围的代码是什么。 __ device __ 函数必须由编译器以独立方式处理。

因此,如果可以内联函数,则两种方法之间应该没有太大差异。如果无法内联函数,则在上述两种方法中,寄存器的用法通常会有明显的差异。



一些明显的因素可能会影响编译器是否会尝试内联 __ device __ 函数是:


  1. 如果 __ device __ 函数与 __ global __ 或其他 __ device __ 处于单独的编译单元中调用它的函数。在这种情况下,唯一可行的方法是通过 CUDA单独进行编译和链接,也称为设备链接。在这种情况下,编译器将不会(无法)内联函数。


  2. 如果 __ noinline __ 指定了编译器指令 。注意,这只是对编译器的提示;



I have read in various places that __device__ functions are almost always inlined by the CUDA compiler. Is it correct to say, then, that there is (generally) no increase in the number of registers used when I move code from a kernel into a__device__ function that is called by the kernel?

As an example, do the following snippets use the same number of registers? Are they equally efficient?

SNIPPET 1

__global__ void manuallyInlined(float *A,float *B,float *C,float *D,float *E) {
    // code that manipulates A,B,C,D and E 
}

SNIPPET 2

__device__ void fn(float *A,float *B,float *C,float *D,float *E) {
    // code that manipulates A,B,C,D and E 
}


__global__ void manuallyInlined(float *A,float *B,float *C,float *D,float *E) {
    fn(A,B,C,D,E);
}

解决方案

The final answer can only be determined by using the tools (compile with -Xptxas -v, or use one of the profilers), but the general answer is that calling a __device__ function can impact the number of registers used (as well as performance, and efficiency).

Depending on your file organization, and how you compile your code, a __device__ function may be inlined. If it is inlined, this generally gives the optimizing compiler (ptxas, mainly) the best chance to adapt register usage as it sees fit. (Note that, at least in theory, this "adaptation" could result in either more or less registers used. However, the inlining case generally results in the compiler using both less registers and possibly higher performance. But the compiler primarily optimizes for higher performance, not less register usage.)

On the other hand, if it is not inlined, then it must be handled as an ordinary function call. Like many other computer architectures, a function call involves setting up a stack frame to pass variables, and then transferring control to the function. In this scenario, the compiler is more restricted because:

  1. It must move variables used by the function to/from the stack frame
  2. It cannot perform other optimizations based on "surrounding" code, because it does not know what the surrounding code is. The __device__ function must be handled in a standalone fashion by the compiler.

So if the function can be inlined, there should not be much difference between your two approaches. If the function cannot be inlined, then there will usually be a noticeable difference in register usage in the above two approaches.

Some obvious factors that may impact whether the compiler will attempt to inline a __device__ function are:

  1. If the __device__ function is in a separate compilation unit from the __global__ or other __device__ function that calls it. In this case, the only way this can work is via CUDA separate compilation and linking, also called device-linking. In such a scenario, the compiler will not (cannot) inline the function.

  2. If the __noinline__ compiler directive is specified. Note that this is only a hint to the compiler; it may be ignored.

这篇关于调用__device__函数是否会影响CUDA中使用的寄存器数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆