从cuda内核中访问类数据成员-如何设计适当的主机/设备交互? [英] Accessing class data members from within cuda kernel - how to design proper host/device interaction?

查看:50
本文介绍了从cuda内核中访问类数据成员-如何设计适当的主机/设备交互?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试将某些 cuda/C 代码转换为更多的OO代码,但是就我目前对cuda功能机制的理解而言,我的目标似乎并不容易实现.在这种情况下,我也找不到很好的解释.毕竟不可能.

I've been trying to transform some cuda/C code into a more OO code, but my goal doesn't seem to be easy to achieve for my current understanding of the cuda functioning mechanism. I haven't been able to find good a explanation either on this situation. It might not be possible after all.

我有一个 myClass 类的 global 对象,其中包含要填充到内核中的数组.

I have a global object of class myClass holding an array to be filled in a kernel.

应如何定义 myClass 中的方法,以使数组和布尔成员在设备中可见,然后可以将该数组复制回主机?我使用的是CUDA 7.5,我的卡的计算能力为3.5.

How should the methods in myClass be defined so that the array and boolean members are visible from device and the array can then be copied back to host? I am using cuda 7.5 and the compute capability of my card is 3.5.

这是描述情况的暂定结构:

This is a tentative structure describing the situation:

#include <cstdio>
#include <cuda.h>
#include <cuda_runtime.h>

class myClass
{
public:
        bool bool_var;    // Set from host and readable from device
        int  data_size;   // Set from host
        __device__ __host__ myClass();
        __device__ __host__ ~myClass();
        __host__ void setValues(bool iftrue, int size);
        __device__ void dosomething(int device_parameter);
        __host__ void export();

        // completely unknown methods
        __host__ void prepareDeviceObj();
        __host__ void retrieveDataToHost();
private:
        int *data; // Filled in device, shared between threads, at the end copied back to host for data output
};

__host__ __device__ myClass::myClass()
{
}

__host__ __device__ myClass::~myClass()
{
#ifdef __CUDACC__
        if(bool_var)
                cudaFree(data);
#else
        free(data);
#endif
}

__host__ void myClass::setValues(bool iftrue, int size)
{
        bool_var  = iftrue;
        data_size = size;
}

__device__ void myClass::dosomething(int idx)
{
        int toadd = idx+data_size;
        atomicAdd(&data[idx], toadd); // data should be unique among threads
}


__global__ void myKernel(myClass obj)
{
        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        if(idx < obj.data_size)
        {
                if(!obj.bool_var)
                        printf("Object is not up to any task here!");
                else
                {
                        printf("Object is ready!");
                        obj.dosomething(idx);
                }
        }
}


myClass globalInstance;

int main(int argc, char** argv)
{
        int some_number = 40;
        globalInstance.setValues(true, some_number);
        globalInstance.prepareDeviceObj();           // unknown
        myKernel<<<1,some_number>>>(globalInstance); // how to pass the object?
        globalInstance.retrieveDataToHost();         // unknown
        globalInstance.export();
        exit(EXIT_SUCCESS);
}

推荐答案

您的方法应该可行.当您按值将对象作为内核参数传递时(如前所述),与从主机到设备的传输相关联的确不需要做太多设置.

Your approach should be workable. When you pass an object by value as a kernel parameter (as you have indicated) there really isn't much setup that needs to be done associated with the transfer from host to device.

您需要在主机和设备上正确分配数据,并在适当的位置使用 cudaMemcpy 类型的操作来移动数据,就像在普通CUDA程序中一样.

You need to properly allocate data on the host and the device, and use cudaMemcpy type operations at appropriate points to move the data, just as you would in an ordinary CUDA program.

在全局范围内声明对象时要注意的一件事是,建议在对象的构造函数或析构函数中使用CUDA API调用.原因已在此处中介绍,我不再赘述这里.尽管这种处理主要针对于在main之前启动的内核,但是CUDA惰性初始化还可能影响在 main 范围之外执行的任何CUDA API调用,该调用适用于在全局范围内实例化的对象的构造函数和析构函数.

One thing to be aware of when declaring an object at global scope as you have done, is that it is recommended not to use CUDA API calls in the object's constructor or destructor. The reasons are covered here, I won't repeat them here. Although that treatment mostly focuses on kernels launched before main, the CUDA lazy initialization can also impact any CUDA API call that is executed outside of main scope, which applies to constructors and destructors of objects instantiated at global scope.

下面是从显示的内容中充实的示例.我基本上没有更改您已经编写的代码,只是为您尚未编写的代码添加了一些方法定义.显然,这里有许多不同的可能方法.有关更多示例,您可能需要查看 CUDA C ++集成示例代码.

What follows is a fleshed out example from what you have shown. I mostly didn't change the code you had already written, just added some method definitions for the ones you hadn't. There's obviously a lot of different possible approaches here. For more examples you might want to look at the CUDA C++ integration sample code.

下面是一个关于您所显示内容的有效示例:

Here's a worked example around what you have shown:

$ cat t1236.cu
#include <cstdio>

class myClass
{
public:
        bool bool_var;    // Set from host and readable from device
        int  data_size;   // Set from host
        __host__ myClass();
        __host__ ~myClass();
        __host__ void setValues(bool iftrue, int size);
        __device__ void dosomething(int device_parameter);
        __host__ void export_data();

        // completely unknown methods
        __host__ void prepareDeviceObj();
        __host__ void retrieveDataToHost();
private:
        int *data; // Filled in device, shared between threads, at the end copied back to host for data output
        int *h_data;
};

__host__ myClass::myClass()
{
}

__host__ myClass::~myClass()
{
}

__host__ void myClass::prepareDeviceObj(){
        cudaMemcpy(data, h_data, data_size*sizeof(h_data[0]), cudaMemcpyHostToDevice);
}
__host__ void myClass::retrieveDataToHost(){
        cudaMemcpy(h_data, data, data_size*sizeof(h_data[0]), cudaMemcpyDeviceToHost);
}

__host__ void myClass::setValues(bool iftrue, int size)
{
        bool_var  = iftrue;
        data_size = size;
        cudaMalloc(&data, data_size*sizeof(data[0]));
        h_data = (int *)malloc(data_size*sizeof(h_data[0]));
        memset(h_data, 0, data_size*sizeof(h_data[0]));
}

__device__ void myClass::dosomething(int idx)
{
        int toadd = idx+data_size;
        atomicAdd(&(data[idx]), toadd); // data should be unique among threads
}
__host__ void myClass::export_data(){
        for (int i = 0; i < data_size; i++) printf("%d ", h_data[i]);
        printf("\n");
        cudaFree(data);
        free(h_data);
}


__global__ void myKernel(myClass obj)
{
        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        if(idx < obj.data_size)
        {
                if(!obj.bool_var)
                        printf("Object is not up to any task here!");
                else
                {
                        //printf("Object is ready!");
                        obj.dosomething(idx);
                }
        }
}


myClass globalInstance;

int main(int argc, char** argv)
{
        int some_number = 40;
        globalInstance.setValues(true, some_number);
        globalInstance.prepareDeviceObj();
        myKernel<<<1,some_number>>>(globalInstance);
        globalInstance.retrieveDataToHost();
        globalInstance.export_data();
        exit(EXIT_SUCCESS);
}
$ nvcc -o t1236 t1236.cu
$ cuda-memcheck ./t1236
========= CUDA-MEMCHECK
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
========= ERROR SUMMARY: 0 errors
$

这篇关于从cuda内核中访问类数据成员-如何设计适当的主机/设备交互?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆