优化工作组规模以减少OpenCL中的总和 [英] Optimal workgroup size for sum reduction in OpenCL

查看:109
本文介绍了优化工作组规模以减少OpenCL中的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下内核来求和.

I am using the following kernel for sum reduciton.

__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
    // load shared mem
    unsigned int tid = get_local_id(0);
    unsigned int bid = get_group_id(0);
    unsigned int gid = get_global_id(0);

    unsigned int localSize = get_local_size(0);
    unsigned int stride = gid * 2;
    sdata[tid] = input[stride] + input[stride + 1];

    barrier(CLK_LOCAL_MEM_FENCE);
    // do reduction in shared mem
    for(unsigned int s = localSize >> 2; s > 0; s >>= 1) 
    {
        if(tid < s) 
        {
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // write result for this block to global mem
    if(tid == 0) output[bid] = sdata[0];
}

它工作正常,但是如果我需要多个工作组(例如,如果我想计算1048576个元素的总和),我不知道如何选择最佳工作组大小或工作组数量.据我了解,我使用的工作组越多,得到的子结果就越多,这也意味着我最终将需要更多的全局缩减.

It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.

我已经看到了一般工作组规模问题的答案

I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?

推荐答案

这个问题可能是我前一段时间回答的一个重复: 确定最佳工作组大小和工作组数量的算法是什么?.

This question is a possible duplicate of one I answered a while back: What is the algorithm to determine optimal work group size and number of workgroup.

对于任何给定的设备,实验都是最好的了解方法.

Experimentation will be the best way to know for sure for any given device.

更新: 我认为您可以安全地坚持一维工作组,就像在示例代码中所做的那样.在主机上,您可以尝试最佳值.

Update: I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.

对于每个设备:

1)查询CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

2)循环几个倍,然后以该组大小运行内核.节省每个测试的执行时间.

2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.

3)当您认为自己有一个最佳值时,请将其硬编码到新内核中以用于该特定设备.这将进一步提高性能.您还可以在特定于设备的内核中消除sdata参数.

3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.

//define your own context, kernel, queue here

int err;
size_t global_size; //set this somewhere to match your test data size
size_t preferred_size;
size_t max_group_size;

err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), preferred_size, NULL);
//check err
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), max_group_size, NULL);
//check err

size_t test_size;

//your vars for hi-res timer go here

for (unsigned int i=preferred_size ; i<=max_group_size ; i+=preferred_size){
    //reset timer
    test_size = (size_t)i;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &test_size, 0, NULL, NULL);
    if(err){
        fail("Unable to enqueue kernel");  //implement your own fail function somewhere..
    }else{
        clfinish(queue);
        //stop timer, save value
        //output timer value and test_size
    }
}

特定于设备的内核看起来像这样,除了第一行应替换您的最佳值:

The device-specific kernel can look like this, except the first line should have your optimal value substituted:

#define LOCAL_SIZE 32
__kernel void reduce(__global float* input, __global float* output)
{
    unsigned int tid = get_local_id(0);
    unsigned int stride = get_global_id(0) * 2;
    __local float sdata[LOCAL_SIZE];
    sdata[tid] = input[stride] + input[stride + 1];

    barrier(CLK_LOCAL_MEM_FENCE);

    for(unsigned int s = LOCAL_SIZE >> 2; s > 0; s >>= 1){
        if(tid < s){
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if(tid == 0) output[get_group_id(0)] = sdata[0];
}

这篇关于优化工作组规模以减少OpenCL中的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆