默认情况下,exp的CUDA C ++数学函数是否具有覆盖函数? [英] Does CUDA C++ math function of exp have override functions by default?

查看:133
本文介绍了默认情况下,exp的CUDA C ++数学函数是否具有覆盖函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题出在文档中,我在其中找到两个函数 exp expf .据说 exp 表示 double exp(double),而 expf 表示 float expf(float).我想知道 exp 是否可以具有默认覆盖版本,例如 float exp(float) fp16 exp(fp16).或者输入类型不同时是否必须使用不同的功能?

The problem comes from the document where I find two function exp and expf. It is said that exp means double exp(double) and expf means float expf(float). I wonder if exp can have default override version such as float exp(float) or fp16 exp(fp16). Or must I use different functions when the input are different types ?

考虑使用模板的情况:

template <typename T>
T compute (T in) {return exp(in);}

如果没有默认的 float exp(float),则无法使用 compute< float>(1.f)来调用此模板函数.我知道我可以这样调用该函数,但是我不知道编译器如何处理它.当我调用 exp(1.f)时,编译器是否首先将输入转换为 double ,然后将返回值转换为 float ,还是编译器直接使用浮点数作为输入?

If there is no default float exp(float), I cannot use compute<float>(1.f) to call this template function. I know that I can call that function like that but I do not how how does the compiler deal with it. When I call exp(1.f), does the compiler first cast the input into double and the cast the return value back to float, or does the compiler use the float number as input directly?

推荐答案

据说exp表示 double exp(double),而 expf 表示 float expf(float).我想知道exp是否可以具有默认覆盖版本,例如 float exp(float) ...

It is said that exp means double exp(double) and expf means float expf(float). I wonder if exp can have default override version such as float exp(float) ...

是的,CUDA编译器的作用与普通C ++编译器的作用相同,并将透明地重载正确类型的函数的正确类型.这适用于 float double ...

Yes, the CUDA compiler does what a normal C++ compiler does and will transparently overload the correct version of the function for the correct type. This works for floatand double ...

...或 fp16 exp(fp16).

...但是目前不适用于半精度浮点.

... but it does not presently work for half precision floating point.

例如:

$ cat overlay.cu
#include <cuda_fp16.h>

template<typename T>
__global__ void kernel(const T* x, const T* y, T* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = exp(x[tid]) * y[tid];
};

template __global__ void kernel<float>(const float*, const float*, float*, int);
template __global__ void kernel<double>(const double*, const double*, double*, int);

将正确编译:

$ nvcc -arch=sm_70 -Xptxas="-v" -c overlay.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelIdEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIdEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 380 bytes cmem[0], 88 bytes cmem[2]
ptxas info    : Compiling entry function '_Z6kernelIfEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIfEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 380 bytes cmem[0]

但添加

template __global__ void kernel<__half>(const __half*, const __half*, __half*, int);

将失败:

$ nvcc -arch=sm_70 -Xptxas="-v" -c overlay.cu
overlay.cu(9): error: more than one instance of overloaded function "exp" matches the argument list:
            function "std::exp(long double)"
            function "std::exp(float)"
            argument types are: (const __half)
          detected during instantiation of "void kernel(const T *, const T *, T *, int) [with T=__half]"

正如评论中指出的那样,C ++ 14/C ++ 17没有定义标准化的半精度类型或标准库,因此此错误与预期的行为非常相符.

As pointed out in comments, C++14/C++17 don't define a standardized half precision type or standard library, so this error is pretty much in line with expected behaviour.

如果要使用半精度版本,那么我建议对fp16版本使用显式模板专门化,以利用该类型的(性能最高的)内在函数,例如:

If you want a half precision version, then I suggest using explicit template specialization for the fp16 version which exploits the (most performant) intrinsic for the type, for example:

#include <cuda_fp16.h>

template<typename T>
__global__ void kernel(const T* x, const T* y, T* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = exp(x[tid]) * y[tid];
};

template __global__ void kernel<float>(const float*, const float*, float*, int);
template __global__ void kernel<double>(const double*, const double*, double*, int);

template<> __global__ void kernel(const __half* x, const __half* y, __half* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = hexp(x[tid]) * y[tid];
};

可能是现阶段最理想的实现,可以按预期编译:

is probably the most optimal implementation at this stage, which compiles as expected:

$ nvcc -std=c++11 -arch=sm_70 -Xptxas="-v" -c overlay.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelI6__halfEvPKT_S3_PS1_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelI6__halfEvPKT_S3_PS1_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 380 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelIdEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIdEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 380 bytes cmem[0], 88 bytes cmem[2]
ptxas info    : Compiling entry function '_Z6kernelIfEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIfEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 380 bytes cmem[0]

[添加了注释,并添加了自己的社论,以解决CUDA标签未答复列表中的问题.请根据需要修改/改进]

[Answer assembled from comments with own editorialisation added to get question off unanswered list for the CUDA tag. Please edit/improve as you see fit]

这篇关于默认情况下,exp的CUDA C ++数学函数是否具有覆盖函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆