总浮点数的最佳OpenCL 2内核是什么? [英] What is the optimum OpenCL 2 kernel to sum floats?

查看:103
本文介绍了总浮点数的最佳OpenCL 2内核是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

C ++ 17引入了许多新算法来支持并行执行,尤其是 std: :reduce std :: accumulate 的并行版本允许non-commutative操作的non-deterministic行为,例如浮点加法.我想使用OpenCL 2实现reduce算法.

C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2.

英特尔在此处 a>使用OpenCL 2 work group内核函数来实现 std :: exclusive_scan OpenCL 2内核.以下是基于Intel的exclusive_scan示例的求和浮点数的内核:

Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan example:

kernel void sum_float (global float* sum, global float* values)
{
  float sum_val = 0.0f;

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    float value = work_group_reduce_add(values[index]);
    sum_val += work_group_broadcast(value, 0u);
  }

  sum[0] = sum_val;
}

上面的内核有效(或似乎可以!).但是,exclusive_scan要求work_group_broadcast函数将一个work group的最后一个值传递给下一个,而该内核仅要求将work_group_reduce_add的结果添加到sum_val中,因此atomic add更合适

The kernel above works (or seems to!). However, exclusive_scan required the work_group_broadcast function to pass the last value of one work group to the next, whereas this kernel only requires the result of work_group_reduce_add to be added to sum_val, so an atomic add is more appropriate.

OpenCL 2提供了一个支持atomic_fetch_addatomic_int.上面使用atomic_int的内核的整数版本是:

OpenCL 2 provides an atomic_int which supports atomic_fetch_add. An integer version of the kernel above using atomic_int is:

kernel void sum_int (global int* sum, global int* values)
{
  atomic_int sum_val;
  atomic_init(&sum_val, 0);

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    int value = work_group_reduce_add(values[index]);
    atomic_fetch_add(&sum_val, value);
  }

  sum[0] = atomic_load(&sum_val);
}

OpenCL 2还提供了atomic_float,但它支持atomic_fetch_add.

OpenCL 2 also provides an atomic_float but it doesn't support atomic_fetch_add.

实现OpenCL2内核求和浮点数的最佳方法是什么?

What is the best way to implement an OpenCL2 kernel to sum floats?

推荐答案

kernel void sum_float (global float* sum, global float* values)
{
  float sum_val = 0.0f;

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    float value = work_group_reduce_add(values[index]);
    sum_val += work_group_broadcast(value, 0u);
  }

  sum[0] = sum_val;
}

这有一个竞争条件,可将数据写入sum的零索引元素,所有工作组都在进行相同的计算,从而使该O(N * N)而不是O(N)花费了1100毫秒以上的时间才能完成1M-元素数组总和.

this has a race condition to write data to sum's zero-indexed element, all workgroups are doing same computation which makes this O(N*N) instead of O(N) and takes more than 1100 milliseconds to complete a 1M-element array sum.

对于相同的1-M元素数组,this(global = 1M,local = 256)

For same 1-M element array, this(global=1M, local=256)

kernel void sum_float2 (global float* sum, global float* values)
{
      float sum_partial = work_group_reduce_add(values[get_global_id(0)]);
      if(get_local_id(0)==0)
        sum[get_group_id(0)] = sum_partial; 
}

此之后(全局= 4k,本地= 256)

followed by this (global=4k, local=256)

kernel void sum_float3 (global float* sum, global float* values)
{
  float sum_partial = work_group_reduce_add(sum[get_global_id(0)]);
  if(get_local_id(0)==0)
    values[get_group_id(0)] = sum_partial; 
}

在几毫秒内完成了相同的操作,除了第三步.第一个将每个组的总和添加到与它们的group-id相关的项目中,第二个内核将这些总和添加为16个值,而这16个值可以很容易地用CPU(微秒或更短)求和(作为第三步).

does the same thing in a few miliseconds except a third step. First one gets each group sums into their group-id related item and second kernel sums those into 16 values and these 16 values can easily summed by CPU(microseconds or less)(as third step).

程序的工作原理如下:

values: 1.0 1.0 .... 1.0 1.0 
sum_float2
sum: 256.0 256.0 256.0
sum_float3
values: 65536.0 65536.0 .... 16 items total to be summed by cpu 

如果需要使用原子,则应尽量少用.最简单的示例是使用局部原子对每个组求和多个值,然后在最后一个步骤中对每个组使用单个全局原子函数将所有值相加.我目前尚未为OpenCL准备好C ++设置,但是我猜想当您使用具有相同内存资源(可能是流模式或在SVM中)和/或的多个设备时,OpenCL 2.0原子会更好使用C ++ 17函数的 CPU .如果您没有在同一时间在同一区域上进行计算的多个设备,那么我想这些新原子只能在已经运行的OpenCL 1.2原子的基础上进行微优化.我没有使用这些新原子,所以把所有这些当作一粒盐.

if you need to use atomics, you should do it as sparsely as possible. Easiest example can be using local atomics to sum many values by each group and then doing last step using a single global atomic function per group to add all. I don't have a C++ setup ready for OpenCL for now, but I guess OpenCL 2.0 atomics are better when you are using multiple devices with same memory resource(probably streaming mode or in SVM) and/or a CPU using C++17 functions. If you don't have multiple devices computing on same area at same time, then I suppose that these new atomics can only be a micro-optimization on top of already working OpenCL 1.2 atomics. I didn't use these new atomics so take all these as a grain of salt.

这篇关于总浮点数的最佳OpenCL 2内核是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆