在并行归约示例opencl中使用cl_float3 [英] Using cl_float3 in parallel reduction example opencl

查看:405
本文介绍了在并行归约示例opencl中使用cl_float3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将openCL的并行归约示例改编为一堆浮点数.现在,我想扩展代码以包含cl_float3.所以我想在cl_float3数组中找到最小值.我认为这是内核中从float到float3的直接扩展.但是,当我从内核返回时,我正在接收垃圾值.下面是内核:

I adapted the parallel reduction example for openCL for a bunch of floats. Now I wanted to expand the code to include cl_float3. So I want to find the minimum among a array of cl_float3. I thought it was a straight forward expansion from float to float3 in kernel. But I am receiving garbage values when i return from the kernel. Below is the kernel:

__kernel void pmin3(__global float3  *src,                                           
                __global float3  *gmin,                                           
                __local  float3  *lmin,                                           
                __global float  *dbg,                                            
                uint           nitems,                                          
                uint           dev)                                             
{                                                                                   
    uint count  = nitems     / get_global_size(0);                                   
    uint idx    = (dev == 0) ? get_global_id(0) * count                              
                        : get_global_id(0);                                     
    uint stride = (dev == 0) ? 1 : get_global_size(0);                               

    // Private min for the work-item                                                 

    float3 pmin = (float3)(pow(2.0,32.0)-1,pow(2.0,32.0)-1,pow(2.0,32.0)-1);                                               

    for (int n = 0; n < count; n++, idx += stride) {                                 
       pmin.x = min(pmin.x,src[idx].x);
       pmin.y = min(pmin.y,src[idx].y);
       pmin.z = min(pmin.z,src[idx].z);                                                
    }                                                                                

    // Reduce values within the work-group into local memory                         

    barrier(CLK_LOCAL_MEM_FENCE);                                                    
    if (get_local_id(0) == 0)
    lmin[0] = (float3)(pow(2.0,32.0)-1,pow(2.0,32.0)-1,pow(2.0,32.0)-1);                                                          
    for (int n = 0; n < get_local_size(0); n++) {                                    
    barrier(CLK_LOCAL_MEM_FENCE);                                                  
    if (get_local_id(0) == n) {
                lmin[0].x = min(lmin[0].x,pmin.x);
                lmin[0].y = min(lmin[0].y,pmin.y);
                lmin[0].z = min(lmin[0].z,pmin.z);
       }                         
   }                                                                                                                                                             
   barrier(CLK_LOCAL_MEM_FENCE);                                                                                                                                    
   // Write to __global gmin which will contain the work-group minima                                                                                               
   if (get_local_id(0) == 0)
      gmin[get_group_id(0)] = lmin[0];                                                                                                       
   // Collect debug information                                                                                                                                       
   if (get_global_id(0) == 0) {                                                    
      dbg[0] = get_num_groups(0);                                                   
      dbg[1] = get_global_size(0);                                                  
      dbg[2] = count;                                                               
      dbg[3] = stride;                                                              
   }                                                                               
 }                      

 __kernel void min_reduce3( __global float3  *gmin)                                         
{                                                                                   
  for (int n = 0; n < get_global_size(0); n++) {                                   
    barrier(CLK_GLOBAL_MEM_FENCE);                                                 
    if (get_global_id(0) == n) {
                gmin[0].x = min(gmin[0].x,gmin[n].x);
                gmin[0].y = min(gmin[0].y,gmin[n].y);                     
                gmin[0].z = min(gmin[0].z,gmin[n].z);
      }
 }
 barrier(CLK_GLOBAL_MEM_FENCE);                                                                                                                              
}         

我认为这是get_global_id(0)和get_global_size()的问题,它给出了整个大小而不是仅给出的行数.有什么建议吗?

I think it is the problem with get_global_id(0) and get_global_size() which gives the entire size instead of the only the number of rows to be given. Any suggestions?

推荐答案

正如其他人所提到的,float3(和其他type3类型)表现为 float4 (和其他type4类型)出于尺寸和对齐的目的.使用内置的vec_step函数也可以看到这一点,该函数返回输入对象类型中的元素数量,但对于type3对象返回4.

As others mentioned, float3 (and other type3 types) behave as float4 (and other type4 types) for the purposes of size and alignment. This could also be seen using the built-in vec_step function, which returns the number of elements in the input object's type, but returns 4 for type3 objects.

如果您的宿主代码生成一个 packed float3数组-每个对象的大小和对齐方式仅为3个浮点数-那么从OpenCL使用它的正确方法是:

If your host code generates a packed float3 array - with each object taking the size and alignment of just 3 floats - then the proper way to use it from OpenCL is:

  • 使用float*参数代替float3*
  • 使用vload3
  • 加载数据
  • 使用vstore3
  • 存储数据
  • Use a float* parameter instead of float3*
  • Load the data using vload3
  • Store data using vstore3

这篇关于在并行归约示例opencl中使用cl_float3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆