为什么会有CL_DEVICE_MAX_WORK_GROUP_SIZE? [英] Why is there a CL_DEVICE_MAX_WORK_GROUP_SIZE?

查看:487
本文介绍了为什么会有CL_DEVICE_MAX_WORK_GROUP_SIZE?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解诸如GPU之类的OpenCL设备的体系结构,但我看不到为什么本地工作组中的工作项数量有明确的界限,即常量CL_DEVICE_MAX_WORK_GROUP_SIZE.

I'm trying to understand the architecture of OpenCL devices such as GPUs, and I fail to see why there is an explicit bound on the number of work items in a local work group, i.e. the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.

在我看来,这应该由编译器来处理,即,如果使用本地工作组大小为500的内核(为简化起见,以一维形式执行),而其物理最大值为100,则该内核看起来像这个:

It seems to me that this should be taken care of by the compiler, i.e. if a (one-dimensional for simplicity) kernel is executed with local workgroup size 500 while its physical maximum is 100, and the kernel looks for example like this:

__kernel void test(float* input) {
    i = get_global_id(0);
    someCode(i);
    barrier();
    moreCode(i);
    barrier();
    finalCode(i);
}

然后可以将其自动转换为该内核上工作组大小为100的执行:

then it could be converted automatically to an execution with work group size 100 on this kernel:

__kernel void test(float* input) {
    i = get_global_id(0);
    someCode(5*i);
    someCode(5*i+1);
    someCode(5*i+2);
    someCode(5*i+3);
    someCode(5*i+4);
    barrier();
    moreCode(5*i);
    moreCode(5*i+1);
    moreCode(5*i+2);
    moreCode(5*i+3);
    moreCode(5*i+4);
    barrier();
    finalCode(5*i);
    finalCode(5*i+1);
    finalCode(5*i+2);
    finalCode(5*i+3);
    finalCode(5*i+4);
}

但是,似乎默认情况下未完成此操作.为什么不?有没有办法使这个过程自动化(除了自己编写一个预编译器之外)?还是存在一个内在的问题,可能使我的方法在某些示例上失败(您可以给我一个示例)?

However, it seems that this is not done by default. Why not? Is there a way to make this process automated (other than writing a pre-compiler for it myself)? Or is there an intrinsic problem which can make my method fail on certain examples (and can you give me one)?

推荐答案

我认为CL_DEVICE_MAX_WORK_GROUP_SIZE的起源在于底层硬件实现.

I think that the origin of the CL_DEVICE_MAX_WORK_GROUP_SIZE lies in the underlying hardware implementation.

多个线程在计算单元上同时运行,并且每个线程都需要保持状态(用于调用,jmp等).大多数实现都为此使用堆栈,如果您查看AMD Evergreen系列,则这是可用堆栈条目数量的硬件限制(每个堆栈条目都有子条目).从本质上讲,这限制了每个计算单元可以同时处理的线程数.

Multiple threads are running simultaneously on computing units and every one of them needs to keep state (for call, jmp, etc). Most implementations use a stack for this and if you look at the AMD Evergreen family their is an hardware limit for the number of stack entries that are available (every stack entry has subentries). Which in essence limits the number of threads every computing unit can handle simultaneously.

至于编译器可以做到这一点.它可以工作,但了解这意味着要重新编译内核.这并不总是可能的.我可以想象这样的情况,开发人员会以二进制格式转储每个平台的编译内核,并随其软件一起提供,只是出于不是那么开源"的原因.

As for the compiler can do this to make it possible. It could work but understand that it would mean to recompile the kernel over again. Which isn't always possible. I can imagine situations where developers dump the compiled kernel for each platform in a binary format and ships it with their software just for "not so open-source" reasons.

这篇关于为什么会有CL_DEVICE_MAX_WORK_GROUP_SIZE?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆