对CUDA内核中的2D数组的操作 [英] operations on a 2D array in CUDA kernel for matlab

查看:256
本文介绍了对CUDA内核中的2D数组的操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下序列C:

int add(int* a, int* b, int n)
{
    for(i=0; i<n; i++)
    {
        for(j=0; j<n; j++)
        {
            a[i][j]+=b[i][j];
        }
    }

    return 0;
}

我认为最好的方法是实现它是一个2D问题并根据 CUDA内核 - 嵌套for循环使用2D线程块

I think the best way to paralellise it is to realise it is a 2D problem and use 2D thread blocks as per CUDA kernel - nested for loop

考虑到这一点,我开始写这样的cuda kernal:

With that in mind I started writing my cuda kernal like this:

__global__ void calc(int **A, int **B, int n)
{

    int i= blockIdx.x * blockDim.x + threadIdx.x;
    int j= blockIdx.y * blockDim.y + threadIdx.y;


    if (i>=n || j>=n)
        return;

    A[i][j]+=B[i][j];


}

nvcc告诉我:

./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space  

1)我的philosphy是正确的吗?
2)我认为我理解块,线程等,但我不明白什么

1) I am correct with my philosphy? 2) I think I understand blocks, thread etc but I don't understand what

    int i= blockIdx.x * blockDim.x + threadIdx.x;
    int j= blockIdx.y * blockDim.y + threadIdx.y;

最有效/最快的方式在一个2D数组上执行操作?即不只是矩阵相加,它可以是任何元素操作。

3) Is this the most efficient/fastest way of performing operations on a 2D array in general? i.e not just matrix addition it could be any "element by element" operation.

4)我能够从matlab调用它吗?通常情况下,当原型是类型** var

4) Will I be able to call it from matlab? normally it freaks when the prototype is of the form type** var

感谢家伙

推荐答案

编译器警告你得到的事实是,在旧的GPU上,内存结构不是平。编译器不能知道你的内核正在工作的指针数组所持有的地址是什么内存空间。因此,它警告您,它假设操作正在全局内存中进行。如果编译Fermi卡的代码(sm_20或sm_21架构),您将不会看到该警告,因为这些卡上的内存模型是平坦的,并且指针在运行时由硬件正确解释。编译器不需要在编译时处理它。

The compiler warnings you are getting come from the fact that on older GPUs, the memory structure is not "flat". The compiler can't know what memory space the addresses held by the pointer arrays your kernel is working in are. So it is warning you that it is assuming the operation is being peforming in global memory. If you compile the code for a Fermi card (sm_20 or sm_21 architecture), you won't see that warning because the memory model on those cards is "flat", and pointers are correctly interpreted by the hardware at runtime. The compiler doesn't need to handle it at compile time.

回答你的每个问题:


  1. 是的。和不。

  1. Yes. And no. The overall idea is about 90% right, but there are several implementation issues which will become apparent from the answers which follow.

CUDA C内置了变量,允许每个线程确定其正在运行的执行网格中的坐标,以及每个块和网格itsef的尺寸。 threadIdx。{xyz} 提供块内的线程坐标, blockIdx。{xyz} 格。 blockDim。{xyz} gridDim。{xyz} 分别提供块和网格的尺寸不是所有的硬件都支持3D网格)。 CUDA使用列主订单来对每个网格内的每个块和块中的线程进行编号。您正在查询的计算是使用线程和块坐标和块大小计算2D网格中的等效 {i,j} 坐标。这在CUDA编程指南的编程模型一章的前几页中有详细讨论。

CUDA C has built in variables to allow each thread to determine its "coordinates" in the execution grid which it is running, and the dimensions of each block and the grid itsef. threadIdx.{xyz} provides the thread coordinates within a block, and blockIdx.{xyz} the block coordinate with the grid. blockDim.{xyz} and gridDim.{xyz} provide the dimensions of the block and the grid, respectively (note not all hardware supports 3D grids). CUDA uses column major order for numbering threads within each block and block within each grid. The calculation you are querying is computing the equivalent {i,j} coordinate in a 2D grid using the thread and block coordinates and the block size. This is discussed in some detail in the first few pages of the "Programming model" chapter of the CUDA programming guide.

不,我说两个原因。

首先,在CUDA中使用指针数组来进行内存访问不是一个好主意。两级指针间接大大增加了达到你的数据的延迟惩罚。与现代CPU架构相比,典型GPU架构的关键区别在于存储器系统。 GPU具有惊人的高峰值存储器带宽,但非常高的访问延迟,而CPU设计为最小延迟。所以必须读取和间接两个指针来从内存中提取值是一个非常大的性能损失。将二维数组或矩阵存储在线性内存中。这是BLAS,LAPACK和Matlab做的。

Firstly, using arrays of pointers for memory access is not a good idea in CUDA. Two levels of pointer indirection hugely increases the latency penalty to get to your data. The key difference in a typical GPU architecture compared to a modern CPU achitecture is the memory system. GPUs have stunningly high peak memory bandwidth, but very high access latency, whereas CPUs are designed for minimal latency. So having to read and indirect two pointers to fetch a value from memory is a very big performance penalty. Store your 2D array or matrix in linear memory instead. This is what BLAS, LAPACK and Matlab do anyway.

其次,你的代码中的每个线程都执行四个整数算术运算的安装开销生产性整数运算(加法)。有一些策略来减少,通常涉及到每个线程处理多个数组元素。

Secondly, every thread in your code is performing four integer arithmetic operations of setup overhead (the index calculations) for every one "productive" integer operation (the addition). There are strategies to reduce that, usually involving having each thread process more than one array element.

如果我要为该操作编写一个内核,我会做代码在我的答案的底部。这使用线性内存和 1D网格。适当占用GPU的线程数量适合处理整个输入数组,每个线程处理许多输入。

If I was to write a kernel for that operation I would do it something like the code at the bottom of my answer. This uses linear memory and a 1D grid. A suitable number of threads to properly occupy the GPU process the whole input array, with each thread processing many inputs.

否。正如我在前面的回答中提到的,Matlab使用线性存储器来存储矩阵,而不是指针数组。这与您的内核代码所期望的布局不匹配。

No. As I mentioned earlier in my answer, Matlab uses linear memory to store matrices, not an array of pointers. This doesn't match the layout your kernel code is expecting.

示例代码:

__global__ void calc(int *A, int *B, int N)
{

    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int s = blockDim.x * gridDim.x;

    for( ; i<N; i+=s) {
        A[i] += B[i];
    }
}

这篇关于对CUDA内核中的2D数组的操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆