CUDA上的二维数组 [英] 2D array on CUDA

查看:239
本文介绍了CUDA上的二维数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 CUDA 动态地分配全球 2D 阵列。我怎样才能做到这一点?

I want to dynamically allocate global 2D array in CUDA. How can I achieve this?

在我的主,我打电话给我的内核在一个循环。但在此之前我所说的内核,我需要在 GPU 分配一些内存。内核调用后一个整数从GPU发送到CPU,告知问题是否得到解决或没有。结果
如果问题不解决,我不会释放旧的内存,因为是它的进一步需求,我应该分配新的内存到 GPU 和调用内核试。

In my main I am calling my Kernel in a loop. But before I call the kernel I need to allocate some memory on the GPU. After the kernel call a single integer is send from GPU to CPU to inform whether the problem is solved or not.
If the problem is not solved, I will not free the old memory , since there is a further need of it, and I should allocate new memory to the GPU and call the kernel again.

一须藤code所示:

int n=0,i=0;
while(n==0)
{
    //allocate 2d memory for MEM[i++] 
    //call kernel(MEM,i)
    // get n from kernel       
}


__global__ void kernerl(Mem,int i)
{
    Mem[0][5]=1;
    Mem[1][0]=Mem[0][5]+23;//can use this when MEM[1] is allocated before kernel call
}

有什么建议?谢谢你。

Any suggestions? Thank you.

推荐答案

二开场白 - 使用动态分配的二维数组是CUDA一个坏主意,并在一个循环做重复的内存分配,也不是一个好主意。这两个招致不必要的性能损失。

Two opening comments - using a dynamically allocated 2D array is a bad idea in CUDA, and doing repetitive memory allocations in a loop is also not a good idea. Both incur needless performance penalties.

有关主机code,是这样的:

For the host code, something like this:

size_t allocsize = 16000 * sizeof(float);
int n_allocations = 16;
float * dpointer
cudaMalloc((void **)&dpointer, n_allocations * size_t(allocsize));

float * dcurrent = dpointer;
int n = 0;
for(int i=0; ((n==0) && (i<n_allocations)); i++, dcurrent+=allocsize) {

    // whatever you do before the kernel

    kernel <<< gridsize,blocksize >>> (dcurrent,.....);

    // whatever you do after the kernel

}

为preferable。在这里,你只叫cudaMalloc一次,并通过偏移到分配,这使得内存分配和管理循环中游离。循环结构还意味着你不能无休止地运行,并用尽所有的GPU内存。

is preferable. Here you only call cudaMalloc once, and pass offsets into the allocation, which makes memory allocation and management free inside the loop. The loop structure also means you can't run endlessly and exhaust all the GPU memory.

在二维数组问题本身,有两个原因,为什么这是一个坏主意。首先,分配需要的2D阵列具有N行的需要(N + 1)cudaMalloc呼叫和一个主机装置存储副本,它是缓慢和难看。其次内核code里面,让你的数据时,GPU必须做两全局内存读取一个指针间接获得行地址,然后一到来自该行的数据读取。这是比这个替代慢得多:

On the 2D array question itself, there are two reasons why it is a bad idea. Firstly, the allocation requires of a 2D array with N rows requires (N+1) cudaMalloc calls and a host device memory copy, which is slow and ugly. Secondly inside the kernel code, to get at your data, the GPU must do two global memory reads, one for the pointer indirection to get the row address, and then one to fetch from the data from the row. That is much slower than this alternative:

#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
__global__ void kernerl(float * Mem, int lda, ....)
{
    Mem[idx(0,5,lda)]=1; // MemMem[0][5]=1;
}

,它使用的索引为一维配置。在GPU内存交易是非常昂贵的,但FLOPS和IOPS很便宜。一个整数乘加是这样做的最有效方法。如果你需要从previous内核调用访问结果,只是通过偏移到previous结果,并使用内核中的两个三分球,这样的事情:

which uses indexing into a 1D allocation. In the GPU memory transactions are very expensive, but FLOPS and IOPS are cheap. A single integer multiply-add is the most efficient way to do this. If you need to access results from a previous kernel call, just pass the offset to the previous results and use two pointers inside the kernel, something like this:

__global__ void kernel(float *Mem, int lda, int this, int previous)
{
   float * Mem0 = Mem + this;
   float * Mem1 = Mem + previous;

}

有效的分布式内存的程序(和CUDA真的是一种分布式内存编程)开始看起来像Fortran的一段时间后,但毕竟是一分钱一分货的便携性,透明度和效率的价格。

Efficient distributed memory programs (and CUDA is really a type of distributed memory programming) start to look like Fortran after a while, but that is the price you pay for portability, transparency and efficiency.

希望这有助于。

这篇关于CUDA上的二维数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆