有关全球和本地工作规模的问题 [英] Questions about global and local work size

查看:67
本文介绍了有关全球和本地工作规模的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在NVIDIA论坛中搜索,我发现这些问题,这也是我感兴趣的问题,但是最近四天左右没有人回答过.你能帮忙吗?

Searching through the NVIDIA forums I found these questions, which are also of interest to me, but nobody had answered them in the last four days or so. Can you help?

进入OpenCL阅读教程对我来说有些事情仍然不清楚.这是我关于本地和全球工作规模的一系列问题.

Digging into OpenCL reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

  1. global_work_size是否必须小于CL_DEVICE_MAX_WORK_ITEM_SIZES? 在我的计算机上,CL_DEVICE_MAX_WORK_ITEM_SIZES = 512、512、64.

  1. Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES? On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.

对于使用的内核,CL_KERNEL_WORK_GROUP_SIZE是推荐的work_group_size吗?

Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?

  1. 或者这是GPU允许的唯一work_group_size吗? 在我的机器上CL_KERNEL_WORK_GROUP_SIZE = 512
  1. Or is this the only work_group_size the GPU allows? On my machine CL_KERNEL_WORK_GROUP_SIZE = 512

  • 我是否需要划分为多个工作组,或者可以只包含一个工作组,但未指定local_work_size?

  • Do I need to divide into work groups or can I have only one, but not specifying local_work_size?

    1. 当我只有一个工作组时,我要注意什么?

  • CL_DEVICE_MAX_WORK_GROUP_SIZE是什么意思? 在我的机器上CL_DEVICE_MAX_WORK_GROUP_SIZE = 512、512、64

  • What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean? On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64

    1. 这是否意味着我可以有一个与CL_DEVICE_MAX_WORK_ITEM_SIZES一样大的工作组?
    1. Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?

  • global_work_size是否是CL_DEVICE_MAX_WORK_ITEM_SIZES的除数? 在我的代码中global_work_size = 20.

  • Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES? In my code global_work_size = 20.

    推荐答案

    通常,您可以选择所需的global_work_size,而local_work_size受基础设备/硬件的约束,因此所有查询结果都会告诉您可能的尺寸用于local_work_size而不是global_work_size.对global_work_size的唯一限制是,它必须是local_work_size的倍数(对于每个维度).

    In general you can choose global_work_size as big as you want, while local_work_size is constraint by the underlying device/hardware, so all query results will tell you the possible dimensions for local_work_size instead of the global_work_size. the only constraint for the global_work_size is that it must be a multiple of the local_work_size (for each dimension).

    工作组大小指定工作组的大小,因此,如果CL_DEVICE_MAX_WORK_ITEM_SIZES512, 512, 64,则意味着您的local_work_size不能大于x和y维度的512,而z维度则为64.

    The work group sizes specify the sizes of the workgroups so if CL_DEVICE_MAX_WORK_ITEM_SIZES is 512, 512, 64 that means your local_work_size can't be bigger then 512 for the x and y dimension and 64 for the z dimension.

    但是,取决于内核,本地组大小也受到限制.这通过CL_KERNEL_WORK_GROUP_SIZE表示.您的累积工作量大小(如所有尺寸的乘积,例如256(如果您的本地大小为16, 16, 1))不得大于该数字.这是因为要在线程之间分配有限的硬件资源(根据查询结果,我假设您是在NVIDIA GPU上编程的,因此线程使用的本地内存和寄存器的数量将限制可以使用的线程数量).并行执行).

    However there is also a constraint on the local group size depending on the kernel. This is expressed through CL_KERNEL_WORK_GROUP_SIZE. Your cumulative workgoupsize (as in the product of all dimensions, e.g. 256 if you have a localsize of 16, 16, 1) must not be greater then that number. This is due to the limited hardware resources to be divided between the threads (from your query results I assume you are programming on a NVIDIA GPU, so the amount of local memory and registers used by a thread will limit the number of threads which can be executed in parallel).

    CL_DEVICE_MAX_WORK_GROUP_SIZE以与CL_KERNEL_WORK_GROUP_SIZE相同的方式定义工作组的最大大小,但特定于设备而不是内核(它应该是标量值,也称为512).

    CL_DEVICE_MAX_WORK_GROUP_SIZE defines the maximum size of a work group in the same manner as CL_KERNEL_WORK_GROUP_SIZE, but specific to the device instead the kernel (and it should be a a scalar value aka 512).

    您可以选择不指定local_work_group_size,在这种情况下,OpenCL实现将为您选择一个本地工作组大小(因此不能保证它仅使用一个工作组).但是,通常不建议这样做,因为您不知道如何将工作划分为工作组,而且不能保证所选的工作组大小将是最佳的.

    You can choose not to specify local_work_group_size, in which case the OpenCL implementation will choose a local work group size for you (so its not a guarantee that it uses only one workgroup). However it's generally not advisiable, since you don't know how your work is divided into workgroups and furthermore it's not guaranteed that the workgroupsize chosen will be optimal.

    但是,您应该注意,通常只使用一个工作组并不是一个好主意(在性能方面,为什么要使用OpenCL).通常,一个工作组必须在一个计算单元上执行,而大多数设备将拥有一个以上的计算单元(现代CPU有2个或更多,每个内核一个,而现代GPU可以有20个或更多).此外,甚至可能没有完全使用工作组执行的一个计算单元,因为几个工作组可以以SMT样式在一个计算单元上执行.为了以最佳方式使用NVIDIA GPU,您需要在一个计算单元上执行768/1024/1536个线程(取决于代,即G80/GT200/GF100),尽管我现在不知道amd的数字,但它们在同样大小,因此最好有一个以上的工作组.此外,对于GPU,通常建议工作组至少具有64个线程(每个工作组至少可被32/64(nvidia/amd)除以整数的线程),因为否则,您的性能将再次降低(32/64是在GPU上执行的最小粒度,因此,如果工作组中的项目较少,它仍将以32/64线程执行,但会丢弃未使用线程中的结果.

    However, you should note that using only one workgroup is generally not a good idea performancewise (and why use OpenCL if performance is not a concern). In general a workgroup has to execute on one compute unit, while most devices will have more then one (modern CPUs have 2 or more, one for each core, while modern GPUs can have 20 or more). Furthermore even the one Compute Unit on which your workgroup executes might not be fully used, since several workgroup can execute on one compute unit in an SMT style. To use NVIDIA GPUs optimally you need 768/1024/1536 threads (depending on the generation, meaning G80/GT200/GF100) executing on one compute unit, and while I don't know the numbers for amd right now, they are in the same magnitude, so it's good to have more then one workgroup. Furthermore, for GPUs, it's typically advisable to have workgroups which at least 64 threads (and a number of threads divisible by 32/64 (nvidia/amd) per workgroup), because otherwise you will again have reduced performance (32/64 is the minimum granuaty for execution on gpus, so if you have less items in a workgroup, it will still execute as 32/64 threads, but discard the results from unused threads).

    这篇关于有关全球和本地工作规模的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆