确定OpenCL工作组大小的限制因素? [英] Determine limiting factor of OpenCL workgroup size?

查看:389
本文介绍了确定OpenCL工作组大小的限制因素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用更少的资源在嵌入式GPU上运行一些为台式机图形卡编写的OpenCL内核.特别是,台式机版本假定始终支持至少256个工作组,但是基于Mali T628 ARM的GPU仅保证64个以上的工作组大小.

I am trying to run some OpenCL kernels written for desktop graphics cards on an embedded GPU with less resources. In particular, the desktop version assumes a work group size of at least 256 is always supported, but the Mali T628 ARM-based GPU only guarantees 64+ work group size.

的确,有些内核报告CL_KERNEL_WORK_GROUP_SIZE只有64,而我不知道为什么.我检查了CL_KERNEL_LOCAL_MEM_SIZE中有问题的内核,它是< 2 KiB,而CL_DEVICE_LOCAL_MEM_SIZE是32 KiB,所以我认为我可以排除__local存储.

Indeed, some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64, and I can't figure out why. I checked the CL_KERNEL_LOCAL_MEM_SIZE for the kernels in question and it is <2 KiB, whereas the CL_DEVICE_LOCAL_MEM_SIZE is 32 KiB, so I think I can rule out __local storage.

还有哪些其他因素(例如寄存器/__private内存?)导致CL_KERNEL_WORK_GROUP_SIZE低,该如何检查使用情况?我对编程内省(例如我已经做过的clGetKernelWorkGroupInfo())和任何我可能不知道的开发工具都持开放态度.

What other factors (eg, registers/__private memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage? I am open to both programmatic introspection (such as clGetKernelWorkGroupInfo() which I have already done some), and any development tools I may not know about.

内核是OpenCV的OpenCL v2.4模块的一部分.特别是icvCalcOrientation"rel =" nofollow"title =" surf.cl> surf.cl .该代码相当复杂,并且设置了多个编译时参数,因此这就是为什么手动分析内核以解决该问题而又没有任何提示的原因,这是不可行的.

The kernels are part of the OpenCL v2.4 module of OpenCV. In particular, the kernel icvCalcOrientation in surf.cl. The code is fairly complex, and there are several compile-time parameters set, so that's why it is a bit infeasible to manually analyze the kernel for the issue without some hint of what to look at.

如果有办法对NVidia或AMD硬件(我可以使用)进行故障排除,那么我可以接受.

If there is a way to troubleshoot this on NVidia or AMD hardware (which I have access to), I am open to it.

推荐答案

编辑

由于我先前的答案是完全错误的,所以我需要有关该问题的更多信息.

EDIT

Since my previous answer was plainly wrong, I need more info on the problem.

说某些内核报告CL_KERNEL_WORK_GROUP_SIZE只有64",这意味着存在存在较大工作组的内核.是这样吗如果不是这样,那么不幸的答案是该设备根本无法支持64个以上的工作项目.

By saying "some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64" you're implying that kernels exist where a larger work-group size is available. Is that the case? If not then the answer unfortunatlely is that the device is simply not capable of supporting more than 64 work-items.

在设置所有内核变量之后和执行内核之前,请 从内核中的设备查询所有可用信息.参数(主要取自())进行查询

Could you please query all available infos from the device in the kernel after setting all kernel agruments and before executing the kernel. The parameters (mostly taken from (Source) ) to query are

  • CL_DEVICE_GLOBAL_MEM_SIZE
  • CL_DEVICE_LOCAL_MEM_SIZE
  • CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
  • CL_DEVICE_MAX_MEM_ALLOC_SIZE
  • CL_DEVICE_MAX_WORK_GROUP_SIZE
  • CL_DEVICE_MAX_WORK_ITEM_SIZES
  • CL_KERNEL_WORK_GROUP_SIZE
  • CL_KERNEL_LOCAL_MEM_SIZE
  • CL_KERNEL_PRIVATE_MEM_SIZE 可能还有更多,但目前没有想到.
  • CL_DEVICE_GLOBAL_MEM_SIZE
  • CL_DEVICE_LOCAL_MEM_SIZE
  • CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
  • CL_DEVICE_MAX_MEM_ALLOC_SIZE
  • CL_DEVICE_MAX_WORK_GROUP_SIZE
  • CL_DEVICE_MAX_WORK_ITEM_SIZES
  • CL_KERNEL_WORK_GROUP_SIZE
  • CL_KERNEL_LOCAL_MEM_SIZE
  • CL_KERNEL_PRIVATE_MEM_SIZE There might be more, but currently none come to mind.

一般信息:

由于本地内存有限,因此可以限制工作组的大小.如果您的内核使用大量私有内存(很多"是相对术语–在较弱的硬件上,即使看似很少的变量也可能达到),则可以达到此限制. 但是这个限制只是在理想条件下.如果您的内核每工作组使用大量WI,则可能会将某些私有WI数据溢出到本地内存中.[...]"(

A workgroup size can be limited because the local memory is limited. And this limit can be reached if you have a kernel that uses lots of private memory ("lots" is a relative term – on weaker hardware this may be reached even with seemingly few variables). "However this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. [...]" (Source).

因此,某些私有内存可能会在您不意识到的情况下交换到本地内存,因此所使用的本地内存的累积大小以及交换的私有内存所需的大小大于可用的本地内存大小.

So some of this private memory may be swapped to local memory without you realizing it so the accumulated size of local memory used and the one needed for swapped private memory is bigger than the available local memory size.

CL_DEVICE_LOCAL_MEM_SIZE返回本地内存的可用大小,CL_KERNEL_LOCAL_MEM_SIZE告诉您已使用了多少本地内存.显然,通过查看clSetKernelArg,这也考虑了动态本地内存,但是,我不确定如果查询CL_KERNEL_LOCAL_MEM_SIZE before设置内核参数(确定大小)时应该如何工作.本地内存...)

CL_DEVICE_LOCAL_MEM_SIZE returns the available size of local memory, CL_KERNEL_LOCAL_MEM_SIZE tells you how much local memory you have used. Aparently this also takes dynamic local memory into consideration by looking at clSetKernelArg, however I am unsure how this is supposed to work if you query CL_KERNEL_LOCAL_MEM_SIZE before setting the kernel argument (which is what you would want to do in order to determine the size of local memory...)

无论如何,OpenCL确切知道您使用了多少本地内存,因此它可以计算出它可以支持多少个工作项(每个工作项都有可能需要交换到本地内存的私有内存).查询CL_KERNEL_WORK_GROUP_SIZE时,可能会得到减少的本地工作量.

Anyway, OpenCL knows exactly how much local memory you use, so it can calculate how many work-items (each of which has private memory that may need swapping to local memory) it can support. This reduced local working size may be what you get when querying CL_KERNEL_WORK_GROUP_SIZE.

查看了您发布的内核后,我认为这里不是本地内存的问题(这是您已经怀疑的问题),特别是因为您仅使用了32 KiB本地内存中的2个.

After looking at the kernel you posted I don't think that local memory is the problem here (which is what you already suspected), especially since you only use 2 of the 32 KiB of local memory.

这篇关于确定OpenCL工作组大小的限制因素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆