确定最佳工作组大小和工作组数量的算法是什么 [英] What is the algorithm to determine optimal work group size and number of workgroup

查看:107
本文介绍了确定最佳工作组大小和工作组数量的算法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

OpenCL标准定义了以下选项以获取有关设备和已编译内核的信息:

OpenCL standard defines the following options to get info about device and compiled kernel:

  • CL_DEVICE_MAX_COMPUTE_UNITS

  • CL_DEVICE_MAX_COMPUTE_UNITS

CL_DEVICE_MAX_WORK_GROUP_SIZE

CL_DEVICE_MAX_WORK_GROUP_SIZE

CL_KERNEL_WORK_GROUP_SIZE

CL_KERNEL_WORK_GROUP_SIZE

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

鉴于此值,我如何计算工作组的最佳规模和工作组的数量?

Given this values, how can I calculate the optimal size of work group and number of work groups?

推荐答案

您可以根据实验通过实验发现这些值.使用探查器获取硬数字.

You discover these values experimentally for your algorithm. Use a profiler to get hard numbers.

我喜欢使用CL_DEVICE_MAX_COMPUTE_UNITS作为工作组的数量,因为我经常依赖于同步工作项.我通常只在分支很少的情况下运行内核,因此在每个计算单元中执行需要花费相同的时间.

I like to use CL_DEVICE_MAX_COMPUTE_UNITS as the number of work groups, because I often rely on synchronizing work items. I usually run kernels with little branching, so the take the same time to execute in each compute unit.

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE的某些倍数最适合您的设备.实际的倍数取决于您的内存访问模式和您对每个工作项执行的工作类型.当运行沉重的计算绑定(ALU)内核时,请使用1作为倍数.如果遇到内存访问瓶颈,请尝试更大的倍数以隐藏内存延迟.使用探查器来确定您的访问时间和ALU时间何时最佳.

Some multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will be optimal for your device. What that multiple actually is depends on your memory access pattern and type of work you are doing with each work item. Use 1 as the multiple when you are running a heavy, compute-bound (ALU) kernel. Try a larger multiple to hide memory latency if you are bottlenecked by memory access. Use a profiler to determine when your access time and your ALU time are optimal.

对于任何设备,ALU提取的最佳比率均为1:1.实际上,这很少实现,因此您要保持ALU/SIMD库饱和.这意味着ALU:fetch应该尽可能大于1.小于1意味着您应该尝试更大的工作组大小以更好地隐藏内存延迟.

Optimal ratio for ALU to fetch is 1:1 for any device. This is rarely achieved in practice, so you want to keep the ALU/SIMD banks saturated. This means ALU:fetch should be greater than 1 whenever possible. Less than 1 means you should try a larger work group size to better hide the memory latency.

这篇关于确定最佳工作组大小和工作组数量的算法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆