与工作组数相对应的计算单位数 [英] Number of Compute Units corresponding to the number of work groups

查看:69
本文介绍了与工作组数相对应的计算单位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要澄清一下.我正在运行小型nvidia GPU(310M)的笔记本电脑上开发OpenCL.当我向设备查询CL_DEVICE_MAX_COMPUTE_UNITS时,结果为2.我读取了用于运行内核的工作组的数量,该数量应与计算单元的数量相对应(

I need some clarification. I'm developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the device for CL_DEVICE_MAX_COMPUTE_UNITS, the result is 2. I read the number of work groups for running a kernel should correspond to the number of compute units (Heterogenous Computing with OpenCL, Chapter 9, p. 186), otherwise it would waste too much global memory bandwitdh.

该芯片还被指定具有16个cuda内核(与我认为的PE相对应).从理论上讲,这是否意味着就全局内存带宽而言,此GPU的最佳性能设置是有两个工作组,每个工作组有16个工作项?

Also the chip is specified to have 16 cuda cores (which correspond to PEs I believe). Does that mean theoretically, the most performant setup for this gpu, regarding global memory bandwith, is to have two work groups with 16 work items each?

推荐答案

虽然将工作组的数量设置为等于CL_DEVICE_MAX_COMPUTE_UNITS在某些硬件上可能是合理的建议,但肯定不是 在NVIDIA GPU上.

While setting the number of work groups to be equal to CL_DEVICE_MAX_COMPUTE_UNITS might be sound advice on some hardware, it certainly is not on NVIDIA GPUs.

在CUDA架构上,OpenCL计算单元相当于多处理器(可以具有8、32或48个内核),并且这些模块旨在能够同时运行多达8个工作组(CUDA中的块) )每个.在较大的输入数据大小下,您可能选择运行数千个工作组,并且特定的GPU每次内核启动最多可以处理65535 x 65535个工作组.

On the CUDA architecture, an OpenCL compute unit is the equivalent of a multiprocessor (which can have either 8, 32 or 48 cores), and these are designed to be able to simultanesouly run up to 8 work groups (blocks in CUDA) each. At larger input data sizes, you might choose to run thousands of work groups, and your particular GPU can handle up to 65535 x 65535 work groups per kernel launch.

OpenCL具有另一个设备属性CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.如果在NVIDIA设备上进行查询,它将返回32(这是"warp",即硬件的自然SIMD宽度).该值是您应使用的工作组大小倍数;每个工作组的大小最多为512个项目,具体取决于每个工作项消耗的资源.特定GPU的标准经验法则是,每个计算单元至少需要192个活动工作项(以CUDA术语表示,每个多处理器的线程)才能覆盖架构的所有延迟,并有可能获得全部内存带宽或全部算术吞吐量,具体取决于关于代码的性质.

OpenCL has another device attribute CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. If you query that on an NVIDIA device, it will return 32 (this is the "warp", or natural SIMD width of the hardware). That value is the work group size multiple you should use; work group sizes can be up to 512 items each, depending on the resources consumed by each work item. The standard rule of thumb for your particular GPU is that you require at least 192 active work items per compute unit (threads per multiprocessor in CUDA terms) to cover all the latency the architecture and potentially obtain either full memory bandwidth or full arithmetic throughput, depending on the nature of your code.

NVIDIA在CUDA工具包中提供了一个很好的文档,称为"CUDA体系结构的OpenCL编程指南".您应该花些时间阅读它,因为它包含了NVIDIA OpenCL实现如何映射到其硬件功能的所有细节,并且它将回答您在此处提出的问题.

NVIDIA ship a good document called "OpenCL Programming Guide for the CUDA Architecture" in the CUDA toolkit. You should take some time to read it, because it contains all the specifics of how the NVIDIA OpenCL implementation maps onto the features of their hardware, and it will answer the questions you have raised here.

这篇关于与工作组数相对应的计算单位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆