OpenCL/CUDA:两阶段归约算法 [英] OpenCL/CUDA: Two-stage reduction Algorithm
问题描述
可以通过调用__reduce()来减少大型数组.多次.
Reduction of large arrays can be done by calling __reduce(); multiple times.
The following code however uses only two stages and is documented here:
但是,我无法理解此两阶段归约的算法.可以给一个简单的解释吗?
However I am unable to understand the algorithm for this two stage reduction. can some give a simpler explanation?
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2; offset > 0; offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
使用CUDA也可以很好地实现它.
It can also be well implemented using CUDA.
推荐答案
您创建N
线程.第一个线程查看位置0,N,2 * N,...处的值.第二个线程查看值1,N + 1、2 * N + 1,....这是第一个循环.它将length
值减少为N个值.
You create N
threads. The first thread looks at values at positions 0, N, 2*N, ... The second thread looks at values 1, N+1, 2*N+1, ... That's the first loop. It reduces length
values into N values.
然后,每个线程将其最小值保存在共享/本地内存中.然后,您将获得一条同步指令(barrier(CLK_LOCAL_MEM_FENCE)
.)然后,您将具有标准的共享/本地内存缩减功能.完成后,本地ID为0的线程将其结果保存在输出数组中.
Then each thread saves its smallest value in shared/local memory. Then you have a synchronization instruction (barrier(CLK_LOCAL_MEM_FENCE)
.) Then you have standard reduction in shared/local memory. When you're done the thread with local id 0 saves its result in the output array.
总而言之,您将值从length
减少到了N/get_local_size(0)
.该代码执行完后,您需要做最后一遍.但是,这可以完成大部分工作,例如,您的长度可能为〜10 ^ 8,N = 2 ^ 16,get_local_size(0)= 256 = 2 ^ 8,并且此代码将10 ^ 8的元素减少为256个元素
All in all, you have a reduction from length
to N/get_local_size(0)
values. You'd need to do one last pass after this code is done executing. However, this gets most of the job done, for example, you might have length ~ 10^8, N = 2^16, get_local_size(0) = 256 = 2^8, and this code reduces 10^8 elements into 256 elements.
您不了解哪些部分?
这篇关于OpenCL/CUDA:两阶段归约算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!