CUDA cudaMemcpy数组结构 [英] CUDA cudaMemcpy Struct of Arrays

查看:327
本文介绍了CUDA cudaMemcpy数组结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在项目中清理CUDA内核的参数.

I'd like to clean up the parameters of CUDA kernels in my project.

现在,内核需要3个uint32_t数组,这导致代码很丑陋:(id表示全局线程id,而valX是任意值)

Now, a kernel needs 3 uint32_t arrays, which leads to pretty ugly code: (id means the global thread id and valX is some arbitrary value)

__global__ void some_kernel(uint32_t * arr1, uint32_t * arr2, uint32_t * arr3){arr1[id] = val1; arr2[id] = val2; arr3[id] = val3;}

我想用一个结构围绕所有这些数组:

I'd like to sorround all those arrays with a struct:

typedef struct S{uint_32_t arr1, uint_32_t arr2, uint_32_t arr3, uint32_t size} S;

其中size表示结构中每个arrX的长度.

where size denotes the length of every arrX inside the struct.

我想要的东西是这样的:

What I would like to have, is something like:

__global__ void some_kernel(S * s){s->arr1[id] = val1; s->arr2[id] = val2; s->arr3[id] = val3;}


对于这样的结构,对应的cudaMalloc和cudaMemcpy会是什么样? 这样做是否有性能上的缺陷,我还没有看到?


What would a corresponding cudaMalloc and cudaMemcpy would look like for a struct like this? Are there any performance drawbacks from this, which I'm not seeing yet?

提前谢谢!

推荐答案

您至少有两个选择.一个很好的选择是talonmies已已提供,但我将介绍您可以采用刻苦学习"的方法.

You have at least two options. One excellent choice was already given by talonmies, but I'll introduce you to the "learn the hard way" approach.

首先,您的结构定义:

typedef struct S {
    uint32_t *arr1;
    uint32_t *arr2;
    uint32_t *arr3; 
    uint32_t size;
} S;

...以及内核定义(带有一些全局变量,但您无需遵循该模式):

...and kernel definition (with some global variable, but you don't need to follow with that pattern):

const int size = 10000;

__global__ void some_kernel(S *s)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    if (id < size)
    {
        s->arr1[id] = 1; // val1
        s->arr2[id] = 2; // val2
        s->arr3[id] = 3; // val3
    }
}

请注意,if可防止您越界.

Notice that if protects you from running out-of-bounds.

接下来,我们带有一些准备数据,执行内核并打印一些结果的函数.第一部分是数据分配:

Next, we come with some function that prepares data, executes kernel and prints some result. Part one is data allocation:

uint32_t *host_arr1, *host_arr2, *host_arr3;
uint32_t *dev_arr1, *dev_arr2, *dev_arr3;

// Allocate and fill host data
host_arr1 = new uint32_t[size]();
host_arr2 = new uint32_t[size]();
host_arr3 = new uint32_t[size]();

// Allocate device data   
cudaMalloc((void **) &dev_arr1, size * sizeof(*dev_arr1));
cudaMalloc((void **) &dev_arr2, size * sizeof(*dev_arr2));
cudaMalloc((void **) &dev_arr3, size * sizeof(*dev_arr3));

// Allocate helper struct on the device
S *dev_s;
cudaMalloc((void **) &dev_s, sizeof(*dev_s));

没什么特别的,您只需分配三个数组和struct.看起来更有趣的是如何处理将此类数据复制到设备中:

It's nothing special, you just allocate three arrays and struct. What looks more interesting is how to handle copying of such data into device:

// Copy data from host to device
cudaMemcpy(dev_arr1, host_arr1, size * sizeof(*dev_arr1), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr2, host_arr2, size * sizeof(*dev_arr2), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr3, host_arr3, size * sizeof(*dev_arr3), cudaMemcpyHostToDevice);

// NOTE: Binding pointers with dev_s
cudaMemcpy(&(dev_s->arr1), &dev_arr1, sizeof(dev_s->arr1), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr2), &dev_arr2, sizeof(dev_s->arr2), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr3), &dev_arr3, sizeof(dev_s->arr3), cudaMemcpyHostToDevice);

除了您注意到的普通数组副本外,还需要将它们与结构绑定"在一起.为此,您需要传递一个指针地址.结果,仅复制了这些指针.

Beside ordinary copy of array you noticed, that it's also neccessary to "bind" them with the struct. For that you need to pass an address of pointer. As result, only these pointers are copied.

下一次内核调用,再次将数据复制回主机并打印结果:

Next kernel call, copy data back again to host and printing results:

// Call kernel
some_kernel<<<10000/256 + 1, 256>>>(dev_s); // block size need to be a multiply of 256

// Copy result to host:
cudaMemcpy(host_arr1, dev_arr1, size * sizeof(*host_arr1), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr2, dev_arr2, size * sizeof(*host_arr2), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr3, dev_arr3, size * sizeof(*host_arr3), cudaMemcpyDeviceToHost);

// Print some result
std::cout << host_arr1[size-1] << std::endl;
std::cout << host_arr2[size-1] << std::endl;
std::cout << host_arr3[size-1] << std::endl;

请记住,在任何严肃的代码中,您应始终检查CUDA API调用中的错误.

Keep in mind that in any serious code you should always check for errors from CUDA API calls.

这篇关于CUDA cudaMemcpy数组结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆