CUDA如何获取网格,块,线程大小和非平方矩阵计算 [英] CUDA how to get grid, block, thread size and parallalize non square matrix calculation

查看:501
本文介绍了CUDA如何获取网格,块,线程大小和非平方矩阵计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA的新用户,需要帮助了解一些事情。我需要帮助并行化这两个for循环。具体如何设置dimBlock和dimGrid使这个运行更快。我知道这看起来像sdk中的向量添加示例,但该示例只是为方阵,当我尝试修改该代码为我的128 x 1024矩阵,它不能正常工作。

I am new to CUDA and need help understanding some things. I need help parallelizing these two for loops. Specifically how to setup the dimBlock and dimGrid to make this run faster. I know this looks like the vector add example in the sdk but that example is only for square matrices and when I try to modify that code for my 128 x 1024 matrix it doesn't work properly.

__global__ void mAdd(float* A, float* B, float* C)
{
    for(int i = 0; i < 128; i++)
    {
        for(int i = 0; i < 1024; i++)
        {
            C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];
        }
    }
}

更大的循环和是代码的最简单的部分,所以我决定尝试并行thia和同时学习CUDA。我已经阅读了指南,但仍然不明白如何得到正确的没有。

This code is part of a larger loop and is the simplest portion of the code, so I decided to try to paralleize thia and learn CUDA at same time. I have read the guides but still do not understand how to get the proper no. of grids/block/threads going and use them effectively.

推荐答案

正如你写的那样,内核是完全串行的。每个启动执行它的线程都将执行相同的工作。

As you have written it, that kernel is completely serial. Every thread launched to execute it is going to performing the same work.

CUDA(以及OpenCL和其他类似的单程序,多数据类型编程模型)背后的主要思想是采用数据并行操作 - 其中相同的,大部分独立的操作必须执行多次 - 并且写入执行该操作的内核。然后启动大量的(半)自治线程以在输入数据集上执行该操作。

The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be performed many times - and write a kernel which performs that operation. A large number of (semi)autonomous threads are then launched to perform that operation across the input data set.

在数组添加示例中,数据并行操作为



In your array addition example, the data parallel operation is

C[k] = A[k] + B[k];

用于介于0和128 * 1024之间的所有k。每个加法操作都是完全独立的, ,因此可以由不同的线程执行。要在CUDA中表达,可以这样编写内核:

for all k between 0 and 128 * 1024. Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:

__global__ void mAdd(float* A, float* B, float* C, int n)
{
    int k = threadIdx.x + blockIdx.x * blockDim.x;

    if (k < n)
        C[k] = A[k] + B[k];
}

[免责声明:用浏览器编写的代码,未经测试,

[disclaimer: code written in browser, not tested, use at own risk]

这里,来自串行代码的内循环和外循环每个操作被一个CUDA线程替换,我已经在代码中添加了一个限制检查,比所需的操作启动更多的线程,不会发生缓冲区溢出。如果内核是这样启动的:

Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / nthreads; // value determine by block size and total work

madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

然后256个块,每个包含512个线程将被启动到GPU硬件上以执行数组相加操作在平行下。请注意,如果输入数据大小不能表示为块大小的很好的整数倍,则块数需要向上舍入以覆盖整个输入数据集。

Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.

以上所有是一个非常简单的CUDA范例的一个非常简单的操作概述,但也许它给了足够的洞察力,让你继续自己。 CUDA是相当成熟的这些天,有很多好的,免费的教育材料漂浮在网络上,你可能可以用来进一步阐明编程模型的许多方面我已经在这个答案。

All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself. CUDA is rather mature these days and there is a lot of good, free educational material floating around the web you can probably use to further illuminate many of the aspects of the programming model I have glossed over in this answer.

这篇关于CUDA如何获取网格,块,线程大小和非平方矩阵计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆