OpenCL浮点数减少 [英] OpenCL float sum reduction

查看：119 发布时间：2020/5/13 21:36:29 multithreading parallel-processing opencl race-condition reduction

本文介绍了OpenCL浮点数减少的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想对这部分我的内核代码(一维数据)进行归约:

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):

__local float sum = 0;
int i;
for(i = 0; i < length; i++)
  sum += //some operation depending on i here;

我不希望只有1个线程来执行此操作，而是要有n个线程(n =长度)，最后要有1个线程来求和.

Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.

在伪代码中，我希望能够编写如下内容:

In pseudo code, I would like to able to write something like this:

int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
  res = sum;

有办法吗?

我的比赛条件总和.

推荐答案

要开始使用，您可以执行以下示例(

To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.

请记住，下面的内核返回许多部分和:每个本地工作组一个，然后返回主机.这意味着您必须通过将所有部分总和加回到主机上来执行最终总和.这是因为(至少在OpenCL 1.2中)没有屏障功能来同步不同工作组中的工作项目.

Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.

如果不希望将主机上的部分和相加，则可以通过启动多个内核来解决此问题.这引入了一些内核调用开销，但是在某些应用程序中，额外的损失是可以接受的或可以忽略的.要在下面的示例中执行此操作，您将需要修改主机代码以重复调用内核，然后在输出向量的数量低于本地大小之后添加逻辑以停止执行内核(详细信息留给您或检查 Scarpino参考).

If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).

为输出添加了额外的内核参数.添加了点积以对浮点4个向量求和.

Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.

__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output) 
{
    int lid = get_local_id(0);
    int group_size = get_local_size(0);
    partial_sums[lid] = data[get_global_id(0)];
    barrier(CLK_LOCAL_MEM_FENCE);

    for(int i = group_size/2; i>0; i >>= 1) {
        if(lid < i) {
            partial_sums[lid] += partial_sums[lid + i];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    if(lid == 0) {
        output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
    }
}

这篇关于OpenCL浮点数减少的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

OpenCL浮点数减少 [英] OpenCL float sum reduction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

OpenCL浮点数减少 [英] OpenCL float sum reduction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭