将2D数组复制到已知可变宽度的GPU [英] Copying 2D arrays to GPU of known variable width

查看:141
本文介绍了将2D数组复制到已知可变宽度的GPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究如何将每行的可变宽度的2D数组复制到GPU中。

I am looking into how to copy a 2D array of variable width for each row into the GPU.

int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;

...

每个 host_matrix [i ] 可能有不同的长度,我知道 length [i] ,并且有问题开始的地方。我想避免复制伪数据。有更好的方法吗?

Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?

根据线程,这不是一个聪明的方式:

According to this thread, that won't be a clever way of doing it:

cudaMalloc(d_array, rows*sizeof(int*));  
for(int i = 0 ; i < rows ; i++)    {  
    cudaMalloc((void **)&d_array[i], length[i] * sizeof(int)); 
}  

但我不能想到任何其他方法。有没有其他更聪明的方式做呢?
可以使用cudaMallocPitch和cudaMemCpy2D进行改进

But I cannot think of any other method. Is there any other smarter way of doing it? Can it be improved using cudaMallocPitch and cudaMemCpy2D ??

推荐答案

为正确的方式分配一个指针数组CUDA中的GPU类似这样:

The correct way to allocate an array of pointers for the GPU in CUDA is something like this:

int **hd_array, **d_array;
hd_array = (int **)malloc(nrows*sizeof(int*));
cudaMalloc(d_array, nrows*sizeof(int*));  
for(int i = 0 ; i < nrows ; i++)    {  
    cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int)); 
}
cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);

(免责声明:用浏览器编写,从未编译,从未测试过,使用风险自负)

(disclaimer: written in browser, never compiled, never tested, use at own risk)

这个想法是先在主机内存中组装一个设备指针数组的副本,然后将它复制到设备中。对于1000行的假设情况,这意味着1001调用 cudaMalloc ,然后1001调用 cudaMemcpy 设备内存分配并将数据复制到设备中。这是一个非常大的管理费,我会建议不要尝试。

The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.

如果您有非常锯齿状数据,需要将其存储在设备上,我可以建议采取提示的所有锯齿状数据问题的母亲 - 大型,非结构化稀疏矩阵 - 并复制您的数据的稀疏矩阵格式之一。使用经典的压缩稀疏行格式作为模型,您可以执行类似这个:

If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems - large, unstructured sparse matrices - and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:

int * data, * rows, * lengths;

cudaMalloc(rows, nrows*sizeof(int));
cudaMalloc(lengths, nrows*sizeof(int));
cudaMalloc(data, N*sizeof(int));

在此方案中,将所有数据存储在单个线性内存分配 data 。锯齿状数组的第i行开始于 data [rows [i]] ,每行的长度为 length [i] 。这意味着您只需要三次内存分配和复制操作将任何数量的数据传输到设备,而不是当前方案中的 nrows ,即。它将开销从O(N)减少到O(1)。

In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).

这篇关于将2D数组复制到已知可变宽度的GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆