在并行归约示例opencl中使用cl_float3 [英] Using cl_float3 in parallel reduction example opencl
问题描述
我将openCL的并行归约示例改编为一堆浮点数.现在,我想扩展代码以包含cl_float3.所以我想在cl_float3数组中找到最小值.我认为这是内核中从float到float3的直接扩展.但是,当我从内核返回时,我正在接收垃圾值.下面是内核:
I adapted the parallel reduction example for openCL for a bunch of floats. Now I wanted to expand the code to include cl_float3. So I want to find the minimum among a array of cl_float3. I thought it was a straight forward expansion from float to float3 in kernel. But I am receiving garbage values when i return from the kernel. Below is the kernel:
__kernel void pmin3(__global float3 *src,
__global float3 *gmin,
__local float3 *lmin,
__global float *dbg,
uint nitems,
uint dev)
{
uint count = nitems / get_global_size(0);
uint idx = (dev == 0) ? get_global_id(0) * count
: get_global_id(0);
uint stride = (dev == 0) ? 1 : get_global_size(0);
// Private min for the work-item
float3 pmin = (float3)(pow(2.0,32.0)-1,pow(2.0,32.0)-1,pow(2.0,32.0)-1);
for (int n = 0; n < count; n++, idx += stride) {
pmin.x = min(pmin.x,src[idx].x);
pmin.y = min(pmin.y,src[idx].y);
pmin.z = min(pmin.z,src[idx].z);
}
// Reduce values within the work-group into local memory
barrier(CLK_LOCAL_MEM_FENCE);
if (get_local_id(0) == 0)
lmin[0] = (float3)(pow(2.0,32.0)-1,pow(2.0,32.0)-1,pow(2.0,32.0)-1);
for (int n = 0; n < get_local_size(0); n++) {
barrier(CLK_LOCAL_MEM_FENCE);
if (get_local_id(0) == n) {
lmin[0].x = min(lmin[0].x,pmin.x);
lmin[0].y = min(lmin[0].y,pmin.y);
lmin[0].z = min(lmin[0].z,pmin.z);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
// Write to __global gmin which will contain the work-group minima
if (get_local_id(0) == 0)
gmin[get_group_id(0)] = lmin[0];
// Collect debug information
if (get_global_id(0) == 0) {
dbg[0] = get_num_groups(0);
dbg[1] = get_global_size(0);
dbg[2] = count;
dbg[3] = stride;
}
}
__kernel void min_reduce3( __global float3 *gmin)
{
for (int n = 0; n < get_global_size(0); n++) {
barrier(CLK_GLOBAL_MEM_FENCE);
if (get_global_id(0) == n) {
gmin[0].x = min(gmin[0].x,gmin[n].x);
gmin[0].y = min(gmin[0].y,gmin[n].y);
gmin[0].z = min(gmin[0].z,gmin[n].z);
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
我认为这是get_global_id(0)和get_global_size()的问题,它给出了整个大小而不是仅给出的行数.有什么建议吗?
I think it is the problem with get_global_id(0) and get_global_size() which gives the entire size instead of the only the number of rows to be given. Any suggestions?
推荐答案
正如其他人所提到的,float3
(和其他type3
类型)表现为 float4 (和其他type4
类型)出于尺寸和对齐的目的.使用内置的vec_step
函数也可以看到这一点,该函数返回输入对象类型中的元素数量,但对于type3
对象返回4.
As others mentioned, float3
(and other type3
types) behave as float4 (and other type4
types) for the purposes of size and alignment. This could also be seen using the built-in vec_step
function, which returns the number of elements in the input object's type, but returns 4 for type3
objects.
如果您的宿主代码生成一个 packed float3
数组-每个对象的大小和对齐方式仅为3个浮点数-那么从OpenCL使用它的正确方法是:
If your host code generates a packed float3
array - with each object taking the size and alignment of just 3 floats - then the proper way to use it from OpenCL is:
- 使用
float*
参数代替float3*
- 使用
vload3
加载数据
- 使用
vstore3
存储数据
- Use a
float*
parameter instead offloat3*
- Load the data using
vload3
- Store data using
vstore3
这篇关于在并行归约示例opencl中使用cl_float3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!