访问不同CUDA内核中的类成员 [英] Accessing Class Member in different CUDA kernels

查看:62
本文介绍了访问不同CUDA内核中的类成员的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个仅在GPU上创建的类<​​code> T ,我想在GPU上创建该类,但在CPU上具有对该类的引用,因此我可以将链接作为参数发送给不同的CUDA内核./p>

I have a GPU-only class T which I want to create on GPU but have a reference to which on the CPU, so I can send the link as an argument to different CUDA kernels.

class T
{
public:
    int v;
public:
    __device__ T() { v = 10; }
    __device__ ~T() {}
    __device__ int compute() { return v; }
};

这里是我用来创建类实例并调用 compute()函数的内核.

Here are the kernels that I was to create the class instance and to call the compute() function.

__global__ void kernel(T* obj, int* out)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        out[0] = obj->compute(); // no kernel error, but it returns garbage
    }
}

__global__ void cudaAllocateGPUObj(T* obj)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        obj = new T;
        // if I call `out[0] = obj->compute();` here, everything works fine
    }
}

main函数只是为 T * 类型的指针分配内存,该指针随后用作 cudaAllocateGPUObj 的参数.

The main function simply allocates memory for the pointer of type T* which later is used as an argument for the cudaAllocateGPUObj.

int main()
{
    int cpu, *gpu;
    cudaMalloc((void**)&gpu, sizeof(int));
    T* obj;
    cudaMalloc((void**)&obj, sizeof(T*));
    cudaAllocateGPUObj<<<1,1>>>(obj);
    kernel<<<1,1>>>(obj, gpu);
    cudaMemcpy(&cpu, gpu, sizeof(int), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    printf("cudaMemcpy\nresult: %d\n", cpu);
    return 0;
}

此代码的问题(如代码注释中所指定)是当我在 cudaAllocateGPUObj <中调用 out [0] = obj-> compute(); 时/code>内核,并将获得的值传输到CPU,一切正确.但是,如果我想在另一个内核中获取成员值,那将变成垃圾,尽管如果我将返回值从 v 变量更改为常量,则一切正常.

The problem with this code (as specified in the comments in the code) is that when I call out[0] = obj->compute(); in the cudaAllocateGPUObj kernel and transfer the obtained value to the CPU, everything is correct. But if I want to obtain the member value in another kernel, it becomes garbage, though if I change the return value from the v variable to a constant, everything works fine.

您能告诉我这段代码有什么问题吗?

Could you please tell me what is wrong with this code.

推荐答案

将参数传递给CUDA内核时,它是一种按值传递机制.您已经从指向对象的指针开始了:

When you pass a parameter to a CUDA kernel, it is a pass-by-value mechanism. You have started with a pointer to an object:

T* obj;

然后,不为该对象分配存储,而是为另一个指针分配存储:

then, instead of allocating storage for the object, you allocate storage for another pointer:

cudaMalloc((void**)&obj, sizeof(T*));

所以我们在这里走错了路.(这是此时的逻辑C编程错误.)接下来,在分配内核中, obj 参数(现在指向GPU内存空间中的某个位置)通过值传递:

so we're headed down the wrong path here. (This is a logical C programming error at this point.) Next, in the allocate kernel, the obj parameter (which now points to some location in GPU memory space) is passed by value:

__global__ void cudaAllocateGPUObj(T* obj)
                                      ^^^ pass-by-value: local copy is made

现在,当您执行此操作时:

Now, when you do this:

        obj = new T;

您创建一个新指针,并使用该新指针覆盖 obj 的本地副本.因此,当然可以在本地工作,但是调用环境中的 obj 副本不会使用该新指针进行更新.

You create a new pointer, and overwrite the local copy of obj with that new pointer. So of course that works locally, but the copy of obj in the calling environment is not updated with that new pointer.

解决此问题的一种可能方法是创建正确的指针对指针方法:

One possible method to fix this is to create a proper pointer-to-pointer methodology:

$ cat t5.cu
#include <stdio.h>

class T
{
public:
    int v;
public:
    __device__ T() { v = 10; }
    __device__ ~T() {}
    __device__ int compute() { return v; }
};

__global__ void kernel(T** obj, int* out)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        out[0] = (*obj)->compute(); 
    }
}

__global__ void cudaAllocateGPUObj(T** obj)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        *obj = new T;
    }
}

int main()
{
    int cpu, *gpu;
    cudaMalloc((void**)&gpu, sizeof(int));
    T** obj;
    cudaMalloc(&obj, sizeof(T*));
    cudaAllocateGPUObj<<<1,1>>>(obj);
    kernel<<<1,1>>>(obj, gpu);
    cudaMemcpy(&cpu, gpu, sizeof(int), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    printf("cudaMemcpy\nresult: %d\n", cpu);
    return 0;
}

$ nvcc -arch=sm_35 -o t5 t5.cu
$ cuda-memcheck ./t5
========= CUDA-MEMCHECK
cudaMemcpy
result: 10
========= ERROR SUMMARY: 0 errors
$

这篇关于访问不同CUDA内核中的类成员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆