CUDA设备函数指针在结构中没有静态指针或符号复制 [英] CUDA device function pointers in structure without static pointers or symbol copies

查看:319
本文介绍了CUDA设备函数指针在结构中没有静态指针或符号复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果可能,我的预期程序流程将如下所示:

My intended program flow would look like the following if it were possible:

typedef struct structure_t
{
  [...]
  /* device function pointer. */
  __device__ float (*function_pointer)(float, float, float[]);
  [...]
} structure;

[...]

/* function to be assigned. */
__device__ float
my_function (float a, float b, float c[])
{
  /* do some stuff on the device. */
  [...]
}

void
some_structure_initialization_function (structure *st)
{
  /* assign. */
  st->function_pointer = my_function;
  [...]
}

这是不可能的,在编译期间有关__device__在结构中的位置的常见错误。

This is not possible, and ends in a familiar error during compilation regarding the placement of __device__ in the structure.

 error: attribute "device" does not apply here

这里有一些类似类型的问题在stackoverflow,但他们都涉及使用静态指针外结构。示例包括设备函数指针作为结构成员设备函数指针。我已经采取类似的方法成功以前在其他代码,我很容易使用静态设备指针,并在任何结构之外定义它们。目前虽然这是一个问题。它被编写为各种API,并且用户可以定义需要包括设备函数指针的一个或两个或几十个结构。因此,在结构之外定义静态设备指针是一个主要问题。

There are some examples of similar types of problems here on stackoverflow, but they all involve the use of static pointers outside the structure. Examples are device function pointers as struct members and device function pointers. I've taken a similar approach with success previously in other codes where it's easy for me to use static device pointers and define them outside of any structures. Currently though this is a problem. It's written as an API of sorts and the user may define one or two or dozens of structures which need to include a device function pointer. So, defining static device pointers outside of the structure is a major problem.

我相当肯定的答案存在于我上面链接的帖子,通过使用符号副本

I'm fairly certain the answer exists within the posts I have linked above, through use symbol copies, but I've not been able to put them to successful use.

推荐答案

您尝试执行的操作可能,但是你在声明和定义将保存的结构并使用函数指针的方式中犯了一些错误。

What you are trying to do is possible, but you have made a few mistakes in the way you are declaring and defining the structures that will hold and use the function pointer.


这是不可能的,并且在编译期间遇到一个常见的错误
关于结构中__device__的位置。

This is not possible, and ends in a familiar error during compilation regarding the placement of __device__ in the structure.

 error: attribute "device" does not apply here


这只是因为您试图为结构或类数据成员分配一个内存空间,这在CUDA中是非法的。在定义或实例化类时,将隐式设置所有类或结构数据成员的内存空间。因此,只有slighl不同(更具体):

This is only because you are attempting to assign a memory space to a structure or class data member, which is illegal in CUDA. The memory space of the all class or structure data members are implicitly set when you define or instantiate a class. So something only slighlty different (and more concrete):

typedef float (* fp)(float, float, float4);

struct functor
{
    float c0, c1;
    fp f;

    __device__ __host__
    functor(float _c0, float _c1, fp _f) : c0(_c0), c1(_c1), f(_f) {};

    __device__ __host__
    float operator()(float4 x) { return f(c0, c1, x); };
};

__global__
void kernel(float c0, float c1, fp f, const float4 * x, float * y, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    struct functor op(c0, c1, f);
    for(int i = tid; i < N; i += blockDim.x * gridDim.x) {
        y[i] = op(x[i]);
    }
}

完全有效。 functor 中的函数指针 fp 隐式地是一个 __ device __ 函数当 functor 的实例在设备代码中实例化时。如果它在主机代码中被实例化,则函数指针将隐含地是主机函数。在内核中,作为参数传递的设备函数指针用于实例化 functor 实例。所有完全合法。

is perfectly valid. The function pointer fp in functor is implicitly a __device__ function when an instance of functor is instantiated in device code. If it were instantiated in host code, the function pointer would implicitly be a host function. In the kernel, a device function pointer passed as argument is used to instantiate a functor instance. All perfectly legal.

我相信我是正确的,说没有直接的方法可以获得 __ device __ 函数在宿主代码中,所以你仍然需要一些静态声明和符号操作。这个可能在CUDA 5中有所不同,但我没有测试它。如果我们用一些 __ device __ 函数和一些支持的主机代码来填充上面的设备代码:

I believe I am correct in saying that there is no direct way to get the address of a __device__ function in host code, so you still require some static declarations and symbol manipulation. This might be different in CUDA 5, but I have not tested it to see. If we flesh out the device code above with a couple of __device__ functions and some supporting host code:

__device__ __host__ 
float f1 (float a, float b, float4 c)
{
    return a + (b * c.x) +  (b * c.y) + (b * c.z) + (b * c.w);
}

__device__ __host__
float f2 (float a, float b, float4 c)
{
    return a + b + c.x + c.y + c.z + c.w;
}

__constant__ fp function_table[] = {f1, f2};

int main(void)
{
    const float c1 = 1.0f, c2 = 2.0f;
    const int n = 20;
    float4 vin[n];
    float vout1[n], vout2[n];
    for(int i=0, j=0; i<n; i++) {
        vin[i].x = j++; vin[i].y = j++;
        vin[i].z = j++; vin[i].w = j++;
    }

    float4 * _vin;
    float * _vout1, * _vout2;
    size_t sz4 = sizeof(float4) * size_t(n);
    size_t sz1 = sizeof(float) * size_t(n);
    cudaMalloc((void **)&_vin, sz4);
    cudaMalloc((void **)&_vout1, sz1);
    cudaMalloc((void **)&_vout2, sz1);
    cudaMemcpy(_vin, &vin[0], sz4, cudaMemcpyHostToDevice);

    fp funcs[2];
    cudaMemcpyFromSymbol(&funcs, "function_table", 2 * sizeof(fp));

    kernel<<<1,32>>>(c1, c2, funcs[0], _vin, _vout1, n);
    cudaMemcpy(&vout1[0], _vout1, sz1, cudaMemcpyDeviceToHost); 

    kernel<<<1,32>>>(c1, c2, funcs[1], _vin, _vout2, n);
    cudaMemcpy(&vout2[0], _vout2, sz1, cudaMemcpyDeviceToHost); 

    struct functor func1(c1, c2, f1), func2(c1, c2, f2); 
    for(int i=0; i<n; i++) {
        printf("%2d %6.f %6.f (%6.f,%6.f,%6.f,%6.f ) %6.f %6.f %6.f %6.f\n", 
                i, c1, c2, vin[i].x, vin[i].y, vin[i].z, vin[i].w,
                vout1[i], func1(vin[i]), vout2[i], func2(vin[i]));
    }

    return 0;
}

您将获得一个完全可编译和可运行的示例。这里两个 __ device __ 函数和一个静态函数表为主机代码提供了一种在运行时检索 __ device __ 函数指针的机制。内核被每个 __ device __ 函数调用一次,并显示结果,以及从主机代码中实例化和调用的完全相同的函子和函数因此在主机上运行)进行比较:

you get a fully compilable and runnable example. Here two __device__ functions and a static function table provide a mechanism for the host code to retrieve __device__ function pointers at runtime. The kernel is called once with each __device__ function and the results displayed, along with the exact same functor and functions instantiated and called from host code (and thus running on the host) for comparison:

$ nvcc -arch=sm_30 -Xptxas="-v" -o function_pointer function_pointer.cu 

ptxas info    : Compiling entry function '_Z6kernelffPFfff6float4EPKS_Pfi' for 'sm_30'
ptxas info    : Function properties for _Z6kernelffPFfff6float4EPKS_Pfi
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z2f1ff6float4
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z2f2ff6float4
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 356 bytes cmem[0], 16 bytes cmem[3]

$ ./function_pointer 
 0      1      2 (     0,     1,     2,     3 )     13     13      9      9
 1      1      2 (     4,     5,     6,     7 )     45     45     25     25
 2      1      2 (     8,     9,    10,    11 )     77     77     41     41
 3      1      2 (    12,    13,    14,    15 )    109    109     57     57
 4      1      2 (    16,    17,    18,    19 )    141    141     73     73
 5      1      2 (    20,    21,    22,    23 )    173    173     89     89
 6      1      2 (    24,    25,    26,    27 )    205    205    105    105
 7      1      2 (    28,    29,    30,    31 )    237    237    121    121
 8      1      2 (    32,    33,    34,    35 )    269    269    137    137
 9      1      2 (    36,    37,    38,    39 )    301    301    153    153
10      1      2 (    40,    41,    42,    43 )    333    333    169    169
11      1      2 (    44,    45,    46,    47 )    365    365    185    185
12      1      2 (    48,    49,    50,    51 )    397    397    201    201
13      1      2 (    52,    53,    54,    55 )    429    429    217    217
14      1      2 (    56,    57,    58,    59 )    461    461    233    233
15      1      2 (    60,    61,    62,    63 )    493    493    249    249
16      1      2 (    64,    65,    66,    67 )    525    525    265    265
17      1      2 (    68,    69,    70,    71 )    557    557    281    281
18      1      2 (    72,    73,    74,    75 )    589    589    297    297
19      1      2 (    76,    77,    78,    79 )    621    621    313    313

如果我正确理解了你的问题,上面的例子应该给你几乎所有的设计模式,设备代码中的提示。

If I have understood your question correctly, the above example should give you pretty much all the design patterns you need to implement your ideas in device code.

这篇关于CUDA设备函数指针在结构中没有静态指针或符号复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆