优化工作组规模以减少OpenCL中的总和 [英] Optimal workgroup size for sum reduction in OpenCL
问题描述
我正在使用以下内核来求和.
I am using the following kernel for sum reduciton.
__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int bid = get_group_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
unsigned int stride = gid * 2;
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s = localSize >> 2; s > 0; s >>= 1)
{
if(tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if(tid == 0) output[bid] = sdata[0];
}
它工作正常,但是如果我需要多个工作组(例如,如果我想计算1048576个元素的总和),我不知道如何选择最佳工作组大小或工作组数量.据我了解,我使用的工作组越多,得到的子结果就越多,这也意味着我最终将需要更多的全局缩减.
It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.
I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?
推荐答案
这个问题可能是我前一段时间回答的一个重复: 确定最佳工作组大小和工作组数量的算法是什么?.
This question is a possible duplicate of one I answered a while back: What is the algorithm to determine optimal work group size and number of workgroup.
对于任何给定的设备,实验都是最好的了解方法.
Experimentation will be the best way to know for sure for any given device.
更新: 我认为您可以安全地坚持一维工作组,就像在示例代码中所做的那样.在主机上,您可以尝试最佳值.
Update: I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.
对于每个设备:
1)查询CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
2)循环几个倍,然后以该组大小运行内核.节省每个测试的执行时间.
2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.
3)当您认为自己有一个最佳值时,请将其硬编码到新内核中以用于该特定设备.这将进一步提高性能.您还可以在特定于设备的内核中消除sdata参数.
3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.
//define your own context, kernel, queue here
int err;
size_t global_size; //set this somewhere to match your test data size
size_t preferred_size;
size_t max_group_size;
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), preferred_size, NULL);
//check err
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), max_group_size, NULL);
//check err
size_t test_size;
//your vars for hi-res timer go here
for (unsigned int i=preferred_size ; i<=max_group_size ; i+=preferred_size){
//reset timer
test_size = (size_t)i;
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &test_size, 0, NULL, NULL);
if(err){
fail("Unable to enqueue kernel"); //implement your own fail function somewhere..
}else{
clfinish(queue);
//stop timer, save value
//output timer value and test_size
}
}
特定于设备的内核看起来像这样,除了第一行应替换您的最佳值:
The device-specific kernel can look like this, except the first line should have your optimal value substituted:
#define LOCAL_SIZE 32
__kernel void reduce(__global float* input, __global float* output)
{
unsigned int tid = get_local_id(0);
unsigned int stride = get_global_id(0) * 2;
__local float sdata[LOCAL_SIZE];
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int s = LOCAL_SIZE >> 2; s > 0; s >>= 1){
if(tid < s){
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(tid == 0) output[get_group_id(0)] = sdata[0];
}
这篇关于优化工作组规模以减少OpenCL中的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!