OpenCL-使用原子还原来实现两倍 [英] OpenCL - using atomic reduction for double

查看:76
本文介绍了OpenCL-使用原子还原来实现两倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道不推荐使用OpenCL-1.x的原子函数,但我只想了解一个原子示例.

I know atomic functions with OpenCL-1.x are not recommended but I just want to understand an atomic example.

以下内核代码无法正常运行,它会生成随机的最终值以计算所有数组值的总和(总和):

The following kernel code is not working well, it produces random final values for the computation of sum of all array values (sum reduction) :

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

void atom_add_double(volatile __local double *val, double delta)
{
  union {
  double f;
  ulong  i;
  } old, new;

  do
  {
   old.f = *val;
   new.f = old.f + delta;
  } 
  while (atom_cmpxchg((volatile __local ulong *)val, old.i, new.i) != old.i);

}  

__kernel void sumGPU ( __global const double *input, 
               __local double *localInput,
               __global double *finalSum
                 )
{

  uint lid = get_local_id(0); 
  uint gid = get_global_id(0);
  uint localSize = get_local_size(0);
  uint groupid = get_group_id(0);
  local double partialSum;
  local double finalSumTemp;

 // Initialize sums
  if (lid==0)
  {
   partialSum = 0.0;
   finalSumTemp = 0.0;
  }
  barrier(CLK_LOCAL_MEM_FENCE);

  // Set in local memory
  int idx = groupid * localSize + lid;
  localInput[lid] = input[idx];

  // Compute atom_add into each workGroup
  barrier(CLK_LOCAL_MEM_FENCE);
  atom_add_double(&partialSum, localInput[lid]);
  // See and Check if barrier below is necessary
  barrier(CLK_LOCAL_MEM_FENCE);

  // Final sum of partialSums
  if (lid==0)
  {
   atom_add_double(&finalSumTemp, partialSum);
   *finalSum = finalSumTemp;
  }

}                   

使用global id策略的版本效果很好,但是上面的版本通过使用local memory(共享内存)传递,没有给出预期的结果(*finalSum的值对于每次执行都是随机的).

The version with global id strategy works good but the version above, which passes by the using of local memory (shared memory), doesn't give the expected results (the value of *finalSum is random for each execution).

这是我在主机代码中放入的缓冲区和内核参数:

Here the Buffers and kernel args that I have put in my host code :

 // Write to buffers
  ret = clEnqueueWriteBuffer(command_queue, inputBuffer, CL_TRUE, 0,
        nWorkItems * sizeof(double), xInput, 0, NULL, NULL);
  ret = clEnqueueWriteBuffer(command_queue, finalSumBuffer, CL_TRUE, 0,
                      sizeof(double), finalSumGPU, 0, NULL, NULL);

 // Set the arguments of the kernel
  clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
  clSetKernelArg(kernel, 1, local_item_size*sizeof(double), NULL);
  clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&finalSumBuffer);

最后,我阅读finalSumBuffer以获得总和值.

and Finally, I read finalSumBuffer to get the sum value.

我认为我的问题来自内核代码,但是我找不到错误在哪里.

I think my issue comes rather from the kernel code but I can't find where is the error.

如果有人可以看到出了什么问题,请告诉我.

If anyone could see what's wrong, this would be nice to tell me.

谢谢

更新1:

我几乎设法完成了这种减少.按照 huseyin tugrul buyukisik 提出的建议,我已经修改了内核代码,如下所示:

I nearly manage to perform this reduction. Following the propositions suggested by huseyin tugrul buyukisik, I have modified the kernel code like this :

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

void atom_add_double(volatile __local double *val, double delta)
{
  union {
  double d;
  ulong  i;
  } old, new;

  do
  {
   old.d = *val;
   new.d = old.d + delta;
  } 
  while (atom_cmpxchg((volatile __local ulong *)val, old.i, new.i) != old.i);

}  

__kernel void sumGPU ( __global const double *input, 
               __local double *localInput,
               __local double *partialSum,
               __global double *finalSum
                 )
{

  uint lid = get_local_id(0); 
  uint gid = get_global_id(0);
  uint localSize = get_local_size(0);
  uint groupid = get_group_id(0);

  // Initialize partial sums
  if (lid==0)
    partialSum[groupid] = 0.0; 


  barrier(CLK_LOCAL_MEM_FENCE);
  // Set in local memory
  int idx = groupid * localSize + lid;
  localInput[lid] = input[idx];

  // Compute atom_add into each workGroup
  barrier(CLK_LOCAL_MEM_FENCE);
  atom_add_double(&partialSum[groupid], localInput[lid]);
  // See and Check if barrier below is necessary
  barrier(CLK_LOCAL_MEM_FENCE);

  // Compute final sum
  if (lid==0)
    *finalSum += partialSum[groupid]; 

}                   

huseyin 所述,我不需要对所有部分和的最终和使用原子函数.

As said huseyin , I don't need to use atomic functions for the final sum of all partial sums.

所以我最后做了:

// Compute final sum
  if (lid==0)
    *finalSum += partialSum[groupid]; 

但是不幸的是,最终的总和没有给出期望的值,并且该值是随机的(例如,对于nwork-items = 1024size-WorkGroup = 16,我得到的随机值是[1e+3 - 1e+4]而不是5.248e+05预期的.

But unfortunately, the final sum doesn't give the value expected and the value is random (for example, with nwork-items = 1024 and size-WorkGroup = 16, I get random values in the order of [1e+3 - 1e+4] instead of 5.248e+05 expected.

这是主机代码中的参数设置:

Here are the setting of arguments into the host code :

 // Set the arguments of the kernel
  clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
  clSetKernelArg(kernel, 1, local_item_size*sizeof(double), NULL);
  clSetKernelArg(kernel, 2, nWorkGroups*sizeof(double), NULL);
  clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&finalSumBuffer);

您能看到我的内核代码错误在哪里吗?

Could you see where is my error in the kernel code ?

谢谢

推荐答案

不是错误,而是逻辑问题:

Not an error but logic issue:

atom_add_double(&finalSumTemp, partialSum);

每个组仅工作一次(通过零局部索引线程).

is working only once per group (by zero-local-indexed thread).

所以你只是在做

finalSumTemp = partialSum

因此这里不需要原子.

so atomics here is not needed.

有比赛条件

*finalSum = finalSumTemp;

在每个零索引本地线程写入相同地址的工作组之间的

. 因此,这应该是原子加法(出于学习目的),也可以写在要添加到主机端的不同单元格上,例如sum_group1 + sum_group2 + ... =总和.

between workgroups where each zero-index local thread writes to same address. So this should be the atomic addition (for learning purposes) or could be written on different cells to be added on host side such as sum_group1+sum_group2+... = total sum.

int idx = groupid * localSize + lid;
localInput[lid] = input[idx];

在这里,使用groupid对多设备求和是可疑的.因为每个设备都有自己的全局范围和工作组ID索引,所以两个设备对于两个不同的组可能具有相同的组ID值.当使用多个设备时,应使用一些与设备相关的偏移量.如:

here using groupid is suspicious for multi-device summation. Because each device has its own global range and workgroup id indexings so two device could have same group id values for two different groups. Some device related offset should be used when multiple devices are used. Such as:

idx= get_global_id(0) + deviceOffset[deviceId];


如果无法避免原子操作,并且如果恰好进行了N次操作,则可以将其移动到单个线程(例如0索引线程)并在第二个内核中循环N次(可能更快),除非该原子操作操作延迟无法通过其他方式隐藏.


Also if atomic operation is inavoidable, and if exactly N times operated, it could be moved to a single thread(such as 0-indexed thread) and looped for N times(probably being faster) in a second kernel unless that atomic operation latency can't be hidden by other means.

这篇关于OpenCL-使用原子还原来实现两倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆