opencl中速度慢的问题 [英] Troubles with slow speeds in opencl

查看:109
本文介绍了opencl中速度慢的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试第一次使用opencl,目标是计算数组中每一行的argmin.由于每一行的操作都独立于其他行,因此我认为将其轻松放在图形卡上是很容易的.

I am trying to use opencl for the first time, the goal is to calculate the argmin of each row in an array. Since the operation on each row is independent of the others, I thought this would be easy to put on the graphics card.

与仅使用外部forloop在cpu上运行代码时相比,使用此代码获得的性能似乎更差.

I seem to get worse performance using this code than when i just run the code on the cpu with an outer forloop, any help would be appreciated.

这是代码:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

int argmin(global double *array, int end)
{
  double minimum = array[0];
  int index;
  for (int j = 0; j < end; j++)
  {
    if (array[j] < minimum)
    {
      minimum = array[j];
      index = j;
    }
  }
  return index;
}

kernel void execute(global double *dist, global long *res, global double *min_dist)
{
  int row_size = 0;
  int i = get_global_id(0);

  int row_index = i * row_size;
  res[i] = argmin(&dist[row_index], row_size);
  min_dist[i] = dist[res[i] + row_index];

}

推荐答案

评论者提出了一些有效的观点,但我会尝试使其更具建设性和组织性:

The commenters make some valid points, but I'll try to be a little more constructive and organised:

  1. 您的数据似乎由 double 个精度浮点值组成.根据您的GPU,这本身可能是个坏消息.消费级GPU通常并未针对使用 double 进行优化,与单精度 float 操作相比,通常只能实现1/32或1/16的吞吐量.不过,许多专业级GPU(Quadro,Tesla,FirePro和某些Radeon Pro卡)都可以使用,与 float 相比,可实现1/2或1/4的吞吐量.由于您仅执行微不足道的算术运算(比较),并且很有可能您的运行时受内存访问控制,因此在消费类硬件上也可能很好.
  2. 我假设您的 row_size 实际上不是0,它将帮助您知道真正的(典型)值是什么,以及它是固定的,逐行还是每次运行都是可变的,但对于每一行.无论如何,除非 row_size 非常小,否则您正在其上运行串行 for 循环这一事实可能会使您的代码退回.
  3. 您的工作量有多大?换句话说,数组中有多少行(如果变化,则给出一个典型范围)?如果它很小,您将几乎看不到GPU并行性的好处:GPU具有大量处理器,并且每个处理器可以调度几个线程.因此,您的工作项需要数以百计甚至更好的数以达到良好的硬件利用率.
  4. 您正在从(大概)系统内存中读取一个非常大的阵列,并且不对其执行任何密集操作.这意味着您的瓶颈通常在内存访问方面-对于分立的GPU,系统内存访问需要通过PCIe进行,因此该链接的速度将使性能达到上限.此外,您的内存访问模式远非GPU理想的选择-您通常希望工作项同时读取相邻的内存单元,因为内存单元通常一次读取64个字节或更多.
  1. Your data appears to consist of double precision floating point values. Depending on your GPU, this can be bad news in itself. Consumer grade GPUs typically are not optimised for working with doubles, often only achieving 1/32 or 1/16 the throughput compared to single-precision float operations. Many pro-grade GPUs (Quadro, Tesla, FirePro, some Radeon Pro cards) are fine with them though, achieving 1/2 or 1/4 throughput versus float. As you're only performing a trivial arithmetic operation (comparison), and there's a good chance your runtime is dominated by memory access, it could be fine on consumer hardware too.
  2. I assume your row_size is not actually 0, it would help to know what the true (typical) value is, and whether it's fixed, variable by row, or variable per run but the same for each row. In any case, unless row_size is very small, the fact that you are running a serial for loop over it could be holding your code back.
  3. How big is your work size? In other words, how many rows in your array (give a typical range if it varies)? If it is very small, you will see little benefit from GPU parallelism: GPUs have a large number of processors and can schedule a few threads per processor. So your work items will need to number hundreds or better thousands to achieve decent hardware utilisation.
  4. You are reading a very large array from (presumably) system memory and not performing any intensive operations on it. This means your bottleneck will typically be on the memory access side - for discrete GPUs, system memory access needs to go through PCIe, so the speed of that link will place an upper bound on your performance. Additionally, your memory access pattern far from ideal for GPUs - you typically want work items to read adjacent memory cells at the same time as the memory unit typically fetches 64 bytes or more at once.

改进建议:

  • 分析.如果可能,请使用您的GPU供应商的性能分析工具确定您的真正瓶颈.否则,我们只是在猜测.
  • 对于(4)-如果有可能,请尽量不要将大量数据移动太多.如果您可以在GPU上生成输入数组,请这样做,这样它们就永远不会离开VRAM.
  • 对于(4)-优化您的内存访问.AMD,NVidia和Intel都具有OpenCL GPU优化指南,这些指南说明了如何执行此操作.本质上,重新组织数据布局或内核,以使相邻的工作项读取相邻的内存.理想情况下,您希望工作项0读取数组项0,工作项1读取数组项1,依此类推.您可能需要使用本地内存在工作项之间进行协调.另一种选择是读取每个工作项的矢量大小的数据块.(例如,每个工作项一次读取一个double8),但在这种情况下,请注意对齐情况.
  • 对于(2)&(3)-除非 row_size 非常小(且固定),否则请尝试将循环拆分为多个工作项,并使用局部内存(归约算法)和全局内存中的原子操作进行协调.
  • 对于(1):如果您已经优化了其他所有内容,并且分析表明您在消费类硬件上比较 double 太慢,请检查是否可以将数据生成为浮动而不会损失精度(这也将使您的内存带宽问题减少一半),或者检查您是否可以以其他方式做得更好,例如将 double 视作 long,然后使用整数运算手动解压缩并比较指数和尾数.
  • Profiling. If at all possible, use your GPU vendor's profiling tools to determine your true bottlenecks. Otherwise we're just guessing.
  • For (4) - if at all possible, try not to move large amounts of data around too much. If you can generate your input arrays on the GPU, do so, so they never leave VRAM.
  • For (4) - Optimise your memory accesses. AMD, NVidia and Intel all have OpenCL GPU optimisation guides which explain how to do this. Essentially, re-structure your data layout, or your kernel such that adjacent work items read adjacent pieces of memory. You ideally want work item 0 to read array item 0, work item 1 to read array item 1, etc. You may need to use local memory to coordinate between work items. Another option is to read vector-sized chunks of data per work item. (e.g. each work-item reads a double8 at a time) Watch for alignment in this case though.
  • For (2) & (3) - Unless row_size is very small (and fixed), try to split your loop across multiple work items and coordinate using local memory (reduction algorithms) and atomic operations in global memory.
  • For (1): If you've optimised everything else and profiling is telling you that comparing doubles on consumer hardware is too slow, either check if you can generate the data as floats without loss of accuracy (this will also halve your memory bandwidth woes), or check if you can otherwise do better somehow, for example by treating the double as a long and manually unpacking and comparing the exponent and mantissa using integer operations.

这篇关于opencl中速度慢的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆