OpenCL-使用原子还原来实现两倍 [英] OpenCL - using atomic reduction for double
问题描述
我知道不推荐使用OpenCL-1.x的原子函数,但我只想了解一个原子示例.
I know atomic functions with OpenCL-1.x are not recommended but I just want to understand an atomic example.
以下内核代码无法正常运行,它会生成随机的最终值以计算所有数组值的总和(总和):
The following kernel code is not working well, it produces random final values for the computation of sum of all array values (sum reduction) :
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
void atom_add_double(volatile __local double *val, double delta)
{
union {
double f;
ulong i;
} old, new;
do
{
old.f = *val;
new.f = old.f + delta;
}
while (atom_cmpxchg((volatile __local ulong *)val, old.i, new.i) != old.i);
}
__kernel void sumGPU ( __global const double *input,
__local double *localInput,
__global double *finalSum
)
{
uint lid = get_local_id(0);
uint gid = get_global_id(0);
uint localSize = get_local_size(0);
uint groupid = get_group_id(0);
local double partialSum;
local double finalSumTemp;
// Initialize sums
if (lid==0)
{
partialSum = 0.0;
finalSumTemp = 0.0;
}
barrier(CLK_LOCAL_MEM_FENCE);
// Set in local memory
int idx = groupid * localSize + lid;
localInput[lid] = input[idx];
// Compute atom_add into each workGroup
barrier(CLK_LOCAL_MEM_FENCE);
atom_add_double(&partialSum, localInput[lid]);
// See and Check if barrier below is necessary
barrier(CLK_LOCAL_MEM_FENCE);
// Final sum of partialSums
if (lid==0)
{
atom_add_double(&finalSumTemp, partialSum);
*finalSum = finalSumTemp;
}
}
使用global id
策略的版本效果很好,但是上面的版本通过使用local memory
(共享内存)传递,没有给出预期的结果(*finalSum
的值对于每次执行都是随机的).
The version with global id
strategy works good but the version above, which passes by the using of local memory
(shared memory), doesn't give the expected results (the value of *finalSum
is random for each execution).
这是我在主机代码中放入的缓冲区和内核参数:
Here the Buffers and kernel args that I have put in my host code :
// Write to buffers
ret = clEnqueueWriteBuffer(command_queue, inputBuffer, CL_TRUE, 0,
nWorkItems * sizeof(double), xInput, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(command_queue, finalSumBuffer, CL_TRUE, 0,
sizeof(double), finalSumGPU, 0, NULL, NULL);
// Set the arguments of the kernel
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
clSetKernelArg(kernel, 1, local_item_size*sizeof(double), NULL);
clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&finalSumBuffer);
最后,我阅读finalSumBuffer
以获得总和值.
and Finally, I read finalSumBuffer
to get the sum value.
我认为我的问题来自内核代码,但是我找不到错误在哪里.
I think my issue comes rather from the kernel code but I can't find where is the error.
如果有人可以看到出了什么问题,请告诉我.
If anyone could see what's wrong, this would be nice to tell me.
谢谢
更新1:
我几乎设法完成了这种减少.按照 huseyin tugrul buyukisik 提出的建议,我已经修改了内核代码,如下所示:
I nearly manage to perform this reduction. Following the propositions suggested by huseyin tugrul buyukisik, I have modified the kernel code like this :
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
void atom_add_double(volatile __local double *val, double delta)
{
union {
double d;
ulong i;
} old, new;
do
{
old.d = *val;
new.d = old.d + delta;
}
while (atom_cmpxchg((volatile __local ulong *)val, old.i, new.i) != old.i);
}
__kernel void sumGPU ( __global const double *input,
__local double *localInput,
__local double *partialSum,
__global double *finalSum
)
{
uint lid = get_local_id(0);
uint gid = get_global_id(0);
uint localSize = get_local_size(0);
uint groupid = get_group_id(0);
// Initialize partial sums
if (lid==0)
partialSum[groupid] = 0.0;
barrier(CLK_LOCAL_MEM_FENCE);
// Set in local memory
int idx = groupid * localSize + lid;
localInput[lid] = input[idx];
// Compute atom_add into each workGroup
barrier(CLK_LOCAL_MEM_FENCE);
atom_add_double(&partialSum[groupid], localInput[lid]);
// See and Check if barrier below is necessary
barrier(CLK_LOCAL_MEM_FENCE);
// Compute final sum
if (lid==0)
*finalSum += partialSum[groupid];
}
如 huseyin 所述,我不需要对所有部分和的最终和使用原子函数.
As said huseyin , I don't need to use atomic functions for the final sum of all partial sums.
所以我最后做了:
// Compute final sum
if (lid==0)
*finalSum += partialSum[groupid];
但是不幸的是,最终的总和没有给出期望的值,并且该值是随机的(例如,对于nwork-items = 1024
和size-WorkGroup = 16
,我得到的随机值是[1e+3 - 1e+4]
而不是5.248e+05
预期的.
But unfortunately, the final sum doesn't give the value expected and the value is random (for example, with nwork-items = 1024
and size-WorkGroup = 16
, I get random values in the order of [1e+3 - 1e+4]
instead of 5.248e+05
expected.
这是主机代码中的参数设置:
Here are the setting of arguments into the host code :
// Set the arguments of the kernel
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
clSetKernelArg(kernel, 1, local_item_size*sizeof(double), NULL);
clSetKernelArg(kernel, 2, nWorkGroups*sizeof(double), NULL);
clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&finalSumBuffer);
您能看到我的内核代码错误在哪里吗?
Could you see where is my error in the kernel code ?
谢谢
推荐答案
不是错误,而是逻辑问题:
Not an error but logic issue:
atom_add_double(&finalSumTemp, partialSum);
每个组仅工作一次(通过零局部索引线程).
is working only once per group (by zero-local-indexed thread).
所以你只是在做
finalSumTemp = partialSum
因此这里不需要原子.
so atomics here is not needed.
有比赛条件
*finalSum = finalSumTemp;
在每个零索引本地线程写入相同地址的工作组之间的
. 因此,这应该是原子加法(出于学习目的),也可以写在要添加到主机端的不同单元格上,例如sum_group1 + sum_group2 + ... =总和.
between workgroups where each zero-index local thread writes to same address. So this should be the atomic addition (for learning purposes) or could be written on different cells to be added on host side such as sum_group1+sum_group2+... = total sum.
int idx = groupid * localSize + lid;
localInput[lid] = input[idx];
在这里,使用groupid对多设备求和是可疑的.因为每个设备都有自己的全局范围和工作组ID索引,所以两个设备对于两个不同的组可能具有相同的组ID值.当使用多个设备时,应使用一些与设备相关的偏移量.如:
here using groupid is suspicious for multi-device summation. Because each device has its own global range and workgroup id indexings so two device could have same group id values for two different groups. Some device related offset should be used when multiple devices are used. Such as:
idx= get_global_id(0) + deviceOffset[deviceId];
如果无法避免原子操作,并且如果恰好进行了N次操作,则可以将其移动到单个线程(例如0索引线程)并在第二个内核中循环N次(可能更快),除非该原子操作操作延迟无法通过其他方式隐藏.
Also if atomic operation is inavoidable, and if exactly N times operated, it could be moved to a single thread(such as 0-indexed thread) and looped for N times(probably being faster) in a second kernel unless that atomic operation latency can't be hidden by other means.
这篇关于OpenCL-使用原子还原来实现两倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!