OpenCL ND范围边界? [英] OpenCL ND-Range boundaries?

查看:201
本文介绍了OpenCL ND范围边界?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑一个执行向量加法的内核:

Consider a kernel which performs vector addition:

__kernel void vecAdd(__global double *a,
                     __global double *b,
                     __global double *c,
                     const unsigned int n)
{                                           
    //Get our global thread ID              
    int id = get_global_id(0);              

    //Make sure we do not go out of bounds  
    if (id < n)                             
        c[id] = a[id] + b[id];              
}

是否真的需要将大小n传递给函数,并检查边界?

Is it really necessary to pass the size n to the function, and do a check on the boundaries ?

我没有查看n就看到了相同的版本.哪个是正确的?

I have seen the same version without the check on n. Which one is correct?

更笼统地说,我想知道如果要处理的数据大小与用户定义的NR-Range不同,会发生什么情况.

More generally, I wonder what happens if the data size to process is different than the user defined NR-Range.

是否会处理剩余的超出范围的数据?

Will the remaining, out-of-bounds, data be processed or not?

  • 是的,如何处理?
  • 如果不是,这是否意味着用户在编写内核时必须考虑边界?

OpenCL是否指定其中任何一个?

Does OpenCL specifies any of that?

谢谢

推荐答案

如果不确定是否要包含n个工作项的倍数,则对n进行检查是个好主意.当您知道只调用带有n个工作项的内核时,检查仅占用处理周期,内核大小和指令调度程序的注意力.

The check against n is a good idea if you aren't certain to have a multiple of n work items. When you know you will only ever call the kernel with n work items, the check is only taking up processing cycles, kernel size, and the instruction scheduler's attention.

传递给内核的多余数据将不会发生.尽管如果您某个时候不使用数据,您确实会浪费时间将其复制到设备上.

Nothing will happen with the extra data you pass to the kernel. Although if you don't use the data at some point, you did waste time copying it to the device.

我希望使内核的工作组和全局大小独立于要完成的全部工作.在这种情况下,我需要输入"n".

I like to make a kernel's work group and global size independent of the total work to be done. I need to pass in 'n' when this is the case.

例如:

__kernel void vecAdd(  __global double *a, __global double *b, __global double *c, const unsigned int n)
{                                           
    //Get our global thread ID and global size
    int gid = get_global_id(0);              
    int gsize = get_global_size(0);              

    //check vs n using for-loop condition
    for(int i=gid; i<n; i+= gsize){
        c[i] = a[i] + b[i];              
    }
}

该示例将为n取任意值以及任何全局大小.每个工作项将从其自己的全局ID开始处理第n个元素.同样的想法也适用于工作组,由于内存的局限性,有时甚至胜过我列出的全局版本.

The example will take an arbitrary value for n, as well as any global size. each work item will process every nth element, beginning at its own global id. The same idea works well with work groups too, sometimes outperforming the global version I have listed due to memory locality.

如果您知道n的值是常数,则通常最好对其进行硬编码(在顶部为DEFINE).这将使编译器针对该特定值进行优化,并消除额外的参数.此类内核的示例包括:DFT/FFT处理,给定阶段的双音阶排序以及使用恒定尺寸的图像处理.

If you know the value of n to be constant, it is often better to hard code it (as a DEFINE at the top). This will let compilers optimize for that specific value and eliminate the extra parameter. Examples of such kernels include: DFT/FFT processing, bitonic sorting at a given stage, and image processing using constant dimensions.

这篇关于OpenCL ND范围边界?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆