在AMD OpenCL内核中展开循环 [英] unroll loops in an AMD OpenCL kernel

查看:143
本文介绍了在AMD OpenCL内核中展开循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试评估OpenCL for AMD之间的性能差异.我拥有用于​​Hough Transfrom的内核 在内核中,运行内核时,我有两个#pragma unroll语句,不会产生任何加速效果

I'm trying to assess the performance differences between OpenCL for AMD .I have kernel for hough transfrom in the kernel i have two #pragma unroll statements when run the kernel not produce any speedup

kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle)
{
     sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
     int gid0 = get_global_id(0);
     int gid1 = get_global_id(1);


     uint4 pixel;
     int x0=0,y0=0,r;
     int maxval=0;
     pixel=read_imageui(imageIn,sampler,(int2)(gid0,gid1));
     if(pixel.x==255)
     {
      #pragma unroll 20
       for(int r=90;r<110;r+=1)
     {
       //  int r=190;

                #pragma unroll 360
               for(int theta=0; theta<360;theta++)
              {

 x0=(int) round(gid0-r*cos( (float) radians( (float) theta) ));
 y0=(int) round(gid1-r*sin( (float) radians( (float) theta) ));


   // if((x0>0) && (x0<get_global_size(0)) && (y0>0)&&(y0<get_global_size(1)))
 //in[w_hough*y0+x0]++;
   }
              }

     }
     }

#pragma展开对AMD OpenCL的帮助没有帮助

does #pragma unroll work with AMD OpenCL any help

推荐答案

展开确实适用于AMD.

Unrolling does work with AMD.

http://developer.amd.com/tools-and -sdks/异构计算/codexl/

该工具包含kernelanalyzer,可让您查看其编译器的实际输出.我用它来验证展开是否确实产生了另一个内核.

That tool includes kernelanalyzer which allows you to see the actual output of their compiler. I used that to verify that the unrolling actually does produce a different kernel.

但是展开循环并不一定能使您加速.毕竟,它只保存跳转指令,而以程序大小为代价,而在GPU中,通常受内存延迟的约束.

However unrolling loops does not necessarily give you any speedup. After all it only saves on the jump instructions at the expense of program size, whereas in GPU you are usually bound by memory latency.

在您的情况下,瓶颈可能是sin/cos函数,在AMD HW(以及其他GPU的函数)上,这些函数的运行速度非常慢.您应该使用native_sin和native_cos.它们不够精确,并且不支持正常范围的长度,这就是为什么默认情况下不使用它们的原因,但是在大多数情况下,它们就足够了. native_函数的精度顺便与DirectX着色器对sin和cos的要求相同.

In your case the bottleneck is probably the sin/cos functions, those are extremely slow on AMD HW (also on other GPU's). You should use native_sin and native_cos. They are not as precise and do not support as long of an range as the normal ones, which is why they don't use them by default, but in most cases they are enough. The precision of the native_ functions is incidentally the same as required by DirectX shaders for sin and cos.

这篇关于在AMD OpenCL内核中展开循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆