策略做最后的削减 [英] Strategy for doing final reduction

查看:167
本文介绍了策略做最后的削减的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现一个OpenCL的版本做减少浮动的阵列。

I am trying to implement an OpenCL version for doing reduction of a array of float.

要实现它,我把在网络上发现了以下code片断:

To achieve it, I took the following code snippet found on the web :

__kernel void sumGPU ( __global const double *input, 
                       __global double *partialSums,
               __local double *localSums)
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Copy from global memory to local memory
  localSums[local_id] = input[get_global_id(0)];

  // Loop for computing localSums
  for (uint stride = group_size/2; stride>0; stride /=2)
     {
      // Waiting for each 2x2 addition into given workgroup
      barrier(CLK_LOCAL_MEM_FENCE);

      // Divide WorkGroup into 2 parts and add elements 2 by 2
      // between local_id and local_id + stride
      if (local_id < stride)
        localSums[local_id] += localSums[local_id + stride];
     }

  // Write result into partialSums[nWorkGroups]
  if (local_id == 0)
    partialSums[get_group_id(0)] = localSums[0];
 }                  

这内核code效果很好,但我想通过增加每个工作组的所有部分和计算最终总和。
目前,我做最后总结这一步CPU用一个简单的循环和迭代 nWorkGroups

This kernel code works well but I would like to compute the final sum by adding all the partial sums of each work group. Currently, I do this step of final sum by CPU with a simple loop and iterations nWorkGroups.

我也看到了原子函数另一种解决方案,但它似乎对int的实施,不浮动。我认为只有CUDA浮法提供原子功能。

I saw also another solution with atomic functions but it seems to be implemented for int, not for floats. I think that only CUDA provides atomic functions for float.

我也看到了,我可以和另一个内核code它执行总和此操作,但我想避免,以保持一个简单的读源此解决方案。也许我不能没有这个解决方案做...

I saw also that I could another kernel code which performs this operation of sum but I would like to avoid this solution in order to keep a simple readable source. Maybe I cannot do without this solution...

我必须告诉你,我在的OpenCL 1.2 (由 clinfo 返回) >的Radeon HD 7970 3GB大溪地(我认为的OpenCL 2.0是不是我的卡支持)。

I must tell you that I use OpenCL 1.2 (returned by clinfo) on a Radeon HD 7970 Tahiti 3GB (I think that OpenCL 2.0 is not supported with my card).

更一般地,我想获取有关我的显卡型号和OpenCL 1.2执行这个最后的决赛求和最简单的方法建议。

More generally, I would like to get advices about the simplest method to perform this last final summation with my graphics card model and OpenCL 1.2.

任何帮助是值得欢迎的,
谢谢

Any help is welcome, Thanks

推荐答案

对不起,previous code。
同时它有问题。

Sorry for previous code. also It has problem.

CLK_GLOBAL_MEM_FENCE仅影响当前工作组。
我感到困惑。 = [

CLK_GLOBAL_MEM_FENCE effects only current workgroup. I confused. =[

如果你想通过GPU减少金额,你应该排队clFinish(commandQueue)后NDRangeKernel功能降低内核。

If you want to reduction sum by GPU, you should enqueue reduction kernel by NDRangeKernel function after clFinish(commandQueue).

<击> Plaese只是把概念。

Plaese just take concept.

__kernel void sumGPU ( __global const double *input,
                       __global double *partialSums,
               __local double *localSums)
  {
 uint local_id = get_local_id(0);
 uint group_size = get_local_size(0);

  // Copy from global memory to local memory
  localSums[local_id] = input[get_global_id(0)];

  // Loop for computing localSums
  for (uint stride = group_size/2; stride>0; stride /=2)
     {
      // Waiting for each 2x2 addition into given workgroup
      barrier(CLK_LOCAL_MEM_FENCE);

      // Divide WorkGroup into 2 parts and add elements 2 by 2
      // between local_id and local_id + stride
      if (local_id < stride)
        localSums[local_id] += localSums[local_id + stride];
     }

  // Write result into partialSums[nWorkGroups]
  if (local_id == 0)
    partialSums[get_group_id(0)] = localSums[0];

    barrier(CLK_GLOBAL_MEM_FENCE);

      if(get_group_id(0)==0){
          if(local_id < get_num_groups(0)){  // 16384
            for(int n=0 ; n<get_num_groups(0) ; n+= group_size )
               localSums[local_id] += partialSums[local_id+n];
            barrier(CLK_LOCAL_MEM_FENCE);

            for(int s=group_size/2;s>0;s/=2){
               if(local_id < s)
                  localSums[local_id] += localSums[local_id+s];
               barrier(CLK_LOCAL_MEM_FENCE);
            }
            if(local_id == 0)
               partialSums[0] = localSums[0];
          }
       }
 }

这篇关于策略做最后的削减的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆