OpenCL/CUDA:两阶段归约算法 [英] OpenCL/CUDA: Two-stage reduction Algorithm

查看:142
本文介绍了OpenCL/CUDA:两阶段归约算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以通过调用__reduce()来减少大型数组.多次.

Reduction of large arrays can be done by calling __reduce(); multiple times.

但是,以下代码仅使用两个阶段,并记录在

The following code however uses only two stages and is documented here:

但是,我无法理解此两阶段归约的算法.可以给一个简单的解释吗?

However I am unable to understand the algorithm for this two stage reduction. can some give a simpler explanation?

__kernel
void reduce(__global float* buffer,
        __local float* scratch,
        __const int length,
        __global float* result) {

    int global_index = get_global_id(0);
    float accumulator = INFINITY;
    // Loop sequentially over chunks of input vector
    while (global_index < length) {
        float element = buffer[global_index];
        accumulator = (accumulator < element) ? accumulator : element;
        global_index += get_global_size(0);
    }

    // Perform parallel reduction
    int local_index = get_local_id(0);
    scratch[local_index] = accumulator;
    barrier(CLK_LOCAL_MEM_FENCE);
    for(int offset = get_local_size(0) / 2; offset > 0; offset = offset / 2) {
        if (local_index < offset) {
            float other = scratch[local_index + offset];
            float mine = scratch[local_index];
            scratch[local_index] = (mine < other) ? mine : other;
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if (local_index == 0) {
        result[get_group_id(0)] = scratch[0];
    }
}

使用CUDA也可以很好地实现它.

It can also be well implemented using CUDA.

推荐答案

您创建N线程.第一个线程查看位置0,N,2 * N,...处的值.第二个线程查看值1,N + 1、2 * N + 1,....这是第一个循环.它将length值减少为N个值.

You create N threads. The first thread looks at values at positions 0, N, 2*N, ... The second thread looks at values 1, N+1, 2*N+1, ... That's the first loop. It reduces length values into N values.

然后,每个线程将其最小值保存在共享/本地内存中.然后,您将获得一条同步指令(barrier(CLK_LOCAL_MEM_FENCE).)然后,您将具有标准的共享/本地内存缩减功能.完成后,本地ID为0的线程将其结果保存在输出数组中.

Then each thread saves its smallest value in shared/local memory. Then you have a synchronization instruction (barrier(CLK_LOCAL_MEM_FENCE).) Then you have standard reduction in shared/local memory. When you're done the thread with local id 0 saves its result in the output array.

总而言之,您将值从length减少到了N/get_local_size(0).该代码执行完后,您需要做最后一遍.但是,这可以完成大部分工作,例如,您的长度可能为〜10 ^ 8,N = 2 ^ 16,get_local_size(0)= 256 = 2 ^ 8,并且此代码将10 ^ 8的元素减少为256个元素

All in all, you have a reduction from length to N/get_local_size(0) values. You'd need to do one last pass after this code is done executing. However, this gets most of the job done, for example, you might have length ~ 10^8, N = 2^16, get_local_size(0) = 256 = 2^8, and this code reduces 10^8 elements into 256 elements.

您不了解哪些部分?

这篇关于OpenCL/CUDA:两阶段归约算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆