OpenCL中的最佳本地/全局工作量 [英] Optimal Local/Global worksizes in OpenCL

查看:127
本文介绍了OpenCL中的最佳本地/全局工作量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何为OpenCL中的不同设备选择最佳的本地和全局工作量? 这是AMD,NVIDIA,INTEL GPU的通用规则吗? 我应该分析设备的物理构造(多处理器数量,多处理器中流式处理器数量等)吗?

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL? Is it any universal rule for AMD, NVIDIA, INTEL GPUs? Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?

这取决于算法/实现吗?因为我看到一些用于评估正确值的库(例如ViennaCL)只是测试了本地/全局工作量的许多组合并选择了最佳组合.

Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.

推荐答案

NVIDIA建议您的(本地)工作组大小为32的倍数(等于一个扭曲,这是它们的基本执行单位,这意味着32个线程) /work-items一起按原子顺序安排).另一方面,AMD建议使用64的倍数(等于一个波前).不确定Intel,但是您可以在他们的文档中找到此类信息.

NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.

因此,当您进行一些计算时,假设您有2300个工作项(全局大小),则2300不能被64或32整除.如果不指定本地大小,则OpenCL将选择一个错误的本地大小适合您.如果您的本地大小不是执行的原子单位的倍数,那么会发生什么事情,就是您将获得空闲线程,从而导致不良的设备利用率.因此,添加一些虚拟"线程可能是有益的,这样您可以获得32/64的倍数的全局大小,然后使用32/64的局部大小(全局大小必须可被局部大小整除) ).对于2300,您可以添加4个虚拟线程/工作项,因为2304可被32整除.在实际的内核中,您可以编写如下内容:

So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:

int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;

这将使四个额外的线程执行与线程0相同的操作(执行一些额外的工作通常会比拥有多个空闲线程快得多).

This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).

希望能回答您的问题. GL HF!

Hope that answered your question. GL HF!

这篇关于OpenCL中的最佳本地/全局工作量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆