CUDA cudaMemcpy 数组结构 [英] CUDA cudaMemcpy Struct of Arrays

查看:62
本文介绍了CUDA cudaMemcpy 数组结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的项目中清理 CUDA 内核的参数.

I'd like to clean up the parameters of CUDA kernels in my project.

现在,内核需要 3 个 uint32_t 数组,这导致代码非常丑陋:(id 表示全局线程 id 和 valX 是某个任意值)

Now, a kernel needs 3 uint32_t arrays, which leads to pretty ugly code: (id means the global thread id and valX is some arbitrary value)

__global__ void some_kernel(uint32_t * arr1, uint32_t * arr2, uint32_t * arr3){arr1[id] = val1; arr2[id] = val2; arr3[id] = val3;}

我想用结构体包围所有这些数组:

I'd like to sorround all those arrays with a struct:

typedef struct S{uint_32_t arr1, uint_32_t arr2, uint_32_t arr3, uint32_t size} S;

其中 size 表示结构中每个 arrX 的长度.

where size denotes the length of every arrX inside the struct.

我想要的是:

__global__ void some_kernel(S * s){s->arr1[id] = val1; s->arr2[id] = val2; s->arr3[id] = val3;}

<小时>

对于这样的结构,相应的 cudaMalloc 和 cudaMemcpy 会是什么样子?这是否有任何性能缺陷,我还没有看到?


What would a corresponding cudaMalloc and cudaMemcpy would look like for a struct like this? Are there any performance drawbacks from this, which I'm not seeing yet?

提前致谢!

推荐答案

您至少有两个选择.一个很好的选择是 已经由 talonmies 提供,但我将介绍您采用艰难学习"的方法.

You have at least two options. One excellent choice was already given by talonmies, but I'll introduce you to the "learn the hard way" approach.

首先,您的结构定义:

typedef struct S {
    uint32_t *arr1;
    uint32_t *arr2;
    uint32_t *arr3; 
    uint32_t size;
} S;

...和内核定义(带有一些全局变量,但您不需要遵循该模式):

...and kernel definition (with some global variable, but you don't need to follow with that pattern):

const int size = 10000;

__global__ void some_kernel(S *s)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    if (id < size)
    {
        s->arr1[id] = 1; // val1
        s->arr2[id] = 2; // val2
        s->arr3[id] = 3; // val3
    }
}

请注意,if 可保护您免于越界.

Notice that if protects you from running out-of-bounds.

接下来,我们提供了一些准备数据、执行内核并打印一些结果的函数.第一部分是数据分配:

Next, we come with some function that prepares data, executes kernel and prints some result. Part one is data allocation:

uint32_t *host_arr1, *host_arr2, *host_arr3;
uint32_t *dev_arr1, *dev_arr2, *dev_arr3;

// Allocate and fill host data
host_arr1 = new uint32_t[size]();
host_arr2 = new uint32_t[size]();
host_arr3 = new uint32_t[size]();

// Allocate device data   
cudaMalloc((void **) &dev_arr1, size * sizeof(*dev_arr1));
cudaMalloc((void **) &dev_arr2, size * sizeof(*dev_arr2));
cudaMalloc((void **) &dev_arr3, size * sizeof(*dev_arr3));

// Allocate helper struct on the device
S *dev_s;
cudaMalloc((void **) &dev_s, sizeof(*dev_s));

没什么特别的,你只是分配了三个数组和结构体.看起来更有趣的是如何处理将此类数据复制到设备中:

It's nothing special, you just allocate three arrays and struct. What looks more interesting is how to handle copying of such data into device:

// Copy data from host to device
cudaMemcpy(dev_arr1, host_arr1, size * sizeof(*dev_arr1), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr2, host_arr2, size * sizeof(*dev_arr2), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr3, host_arr3, size * sizeof(*dev_arr3), cudaMemcpyHostToDevice);

// NOTE: Binding pointers with dev_s
cudaMemcpy(&(dev_s->arr1), &dev_arr1, sizeof(dev_s->arr1), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr2), &dev_arr2, sizeof(dev_s->arr2), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr3), &dev_arr3, sizeof(dev_s->arr3), cudaMemcpyHostToDevice);

除了您注意到的普通数组副本之外,还需要将它们与结构绑定".为此,您需要传递指针地址.结果,只复制了这些指针.

Beside ordinary copy of array you noticed, that it's also neccessary to "bind" them with the struct. For that you need to pass an address of pointer. As result, only these pointers are copied.

下一次内核调用,再次将数据复制回主机并打印结果:

Next kernel call, copy data back again to host and printing results:

// Call kernel
some_kernel<<<10000/256 + 1, 256>>>(dev_s); // block size need to be a multiply of 256

// Copy result to host:
cudaMemcpy(host_arr1, dev_arr1, size * sizeof(*host_arr1), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr2, dev_arr2, size * sizeof(*host_arr2), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr3, dev_arr3, size * sizeof(*host_arr3), cudaMemcpyDeviceToHost);

// Print some result
std::cout << host_arr1[size-1] << std::endl;
std::cout << host_arr2[size-1] << std::endl;
std::cout << host_arr3[size-1] << std::endl;

请记住,在任何严肃的代码中,您都应该始终检查 CUDA API 调用中的错误.

Keep in mind that in any serious code you should always check for errors from CUDA API calls.

这篇关于CUDA cudaMemcpy 数组结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆