CUDA:如何在 GPU 中将数组的所有元素加总为一个数字? [英] CUDA: how to sum all elements of an array into one number within the GPU?

查看:19
本文介绍了CUDA:如何在 GPU 中将数组的所有元素加总为一个数字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,让我声明我完全知道我的问题已经被问到:BlockCUDA 的减少 但是,正如我希望澄清的那样,我的问题是对此的跟进,我有特殊的需求,使得该 OP 找到的解决方案不合适.

First of all, let me state that I am fully aware that my question has been already asked: Block reduction in CUDA However, as I hope to make clear, my question is a follow-up to that and I have particular needs that make the solution found by that OP to be unsuitable.

那么,让我解释一下.在我当前的代码中,我在 while 循环的每次迭代中运行一个 Cuda 内核,以对数组的值进行一些计算.举个例子,可以这样想:

So, let me explain. In my current code, I run a Cuda kernel at every iteration of a while-loop to do some computations over the values of an array. As an example, think of it like the following:

int max_iterations = 1000;
int iteration = 0;
while(iteration < max_iterations)
{
    __global__ void calcKernel(int* idata, int* odata)
    {
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n)
        {
            odata[i] = (idata[i] + 2) * 5;
        }
    }

    iteration++;
}

但是,接下来我必须为 GPU 完成看似艰巨的任务.在调用内核的 while 循环的每次迭代中,我必须对 odata 中生成的所有值求和,并将结果保存在一个名为 result<的 int 数组中/code>,位于此类数组中与当前迭代相对应的位置.它必须在 在内核中 或至少 仍在 GPU 中 完成,因为由于性能限制,我只能检索 result 数组所有迭代完成后的最后.

However, next I have to accomplish seemingly hard task for the GPU. At each iteration of the while-loop that calls the kernel, I have to sum all values generated within odata and save the result in an intarray called result, at a position within such array that corresponds to the current iteration. It has to be accomplished inside the kernel or at least still in the GPU because due to performance constrains, I can only retrieve the result array in the very end after all iterations are completed.

一个错误的幼稚尝试看起来像下面这样:

A wrong naïve attempt woud look something like the following:

int max_iterations = 1000;
int iteration = 0;
while(iteration < max_iterations)
{
    __global__ void calcKernel(int* idata, int* odata, int* result)
    {
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n)
        {
            odata[i] = (idata[i] + 2) * 5;
        }
    }

    result[iteration] = 0;
    for(int j=0; j < max_iterations; j++)
    {
        result[iteration] += odata[j];            
    }

    iteration++;
}

当然,由于 GPU 跨线程分发代码,因此上面的代码不起作用.为了了解如何正确地做到这一点,我一直在阅读站点中有关使用 CUDA 减少阵列的其他问题.特别是,我发现一个关于此类主题的非常好的 NVIDIA pdf 的提及,我在开头提到的前一个 SO 问题中也对此进行了讨论:http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

Of course, the code above does not work due to the GPU distributing the code across threads. In order to lear how to properly do that, I have been reading other questions here in the site about array reduction using CUDA. In particular, I found a mention to a very good NVIDIA's pdf about such subject, which is also discussed in the former SO question I mentioned at the beginning: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

但是,虽然我完全理解此类幻灯片中描述的代码步骤以及一般优化,但我不明白如果代码实际上输出完整,那么该方法如何将数组归约为一个数字数组(以及不明确的维度之一).有人可以解释一下,并向我展示一个如何工作的示例(即如何从输出数组中取出一个数字)?

However, while I fully understand the steps of the code described in such slides, as well as the general optimizations, I don't get how that approach can sum-reduce an array to one number if the code actually ouptus a full array (and one of unclear dimensions). Could someone please shed some light about it and show me an example on how that would work (i.e. how to get the one-number out of the output array)?

现在,回到我在开头提到的那个问题(CUDA 中的块缩减).请注意,其接受的答案仅建议阅读我上面链接的 pdf - not 谈论如何处理代码生成的输出数组.在评论中,那里的 OP 提到他/她能够通过在 CPU 上对输出数组求和来完成这项工作——这是我无法做到的,因为这意味着我的 while 循环的每次迭代都下载输出数组.最后,该链接中的第三个答案建议使用库来完成此任务 - 但我有兴趣学习这样做的本机方式.

Now, going back to that question I mentioned at the beginning (Block reduction in CUDA). Note that its accepted answer merely suggests one to read the pdf I linked above - which does not talk about what to do with the output array generated by the code. In the comments, the OP there mentions that he/she was able to finishi the job by summing the output array at the CPU - which is something I cannot do, since that would mean downloading the output array every iteration of my while-loop. Lastly, the third answer in that link suggests the use of a library to accomplish this - but I am interested in learning the native way of doing so.

另外,我也会对任何其他关于如何实现我上面描述的内容的提议非常感兴趣.

Alternatively, I would be also very interested in any other propositions about how to implement what I am described above.

推荐答案

你已经找到了关于块并行缩减的规范信息,所以我不再重复.如果您不想自己编写很多新代码来执行此操作,我建议您查看 CUB 库 block_reduce 实现,它通过在现有内核中添加大约 4 行代码来提供最佳的块缩减操作.

You have already found the canonical information regarding block parallel reductions, so I will not repeat that. If you don't want to write a lot of new code yourself to do this, I would suggest looking at the CUB library block_reduce implementation, which provides an optimal block wise reduction operation with the addition of about 4 lines of code to your existing kernel.

关于这里真正的问题,如果你做这样的事情,你可以做你想做的事:

On the real question here, you can do what you want if you do something like this:

__global__ void kernel(....., int* iter_result, int iter_num) {

    // Your calculations first so that each thread holds its result

    // Block wise reduction so that one thread in each block holds sum of thread results

    // The one thread holding the adds the block result to the global iteration result
    if (threadIdx.x == 0)
        atomicAdd(iter_result + iter_num, block_ressult);
}

这里的关键是 原子函数 用于使用给定块的结果安全地更新内核运行结果,而不会出现内存竞争.您绝对必须在运行内核之前初始化 iter_result,否则代码将无法工作,但这是基本的内核设计模式.

The key here is that an atomic function is used to safely update the kernel run result with the results from a given block without a memory race. You absolutely must initialise iter_result before running the kernel, otherwise the code won't work, but that is the basic kernel design pattern.

这篇关于CUDA:如何在 GPU 中将数组的所有元素加总为一个数字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆