为什么OpenCL工作组规模会对GPU产生巨大的性能影响? [英] Why OpenCL work group size has huge performance impact on GPU?

查看:177
本文介绍了为什么OpenCL工作组规模会对GPU产生巨大的性能影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Qualcomm Adreno 630 GPU上基准测试了一个简单的矩阵换位内核,并且试图查看不同工作组规模的影响,但是令人惊讶的是,我得到了一些有趣的结果,我无法解释.这是我的内核代码:

I am benchmarking a simple matrix transposition kernel on Qualcomm Adreno 630 GPU, and I am trying to see the impact of different work group size, but surprisingly, I get some interesting result which I cannot explain. Here is my kernel code:

__kernel void transpose(__global float *input, __global float *output, const int width, const int height)
    int i = get_global_id(0);
    int j = get_global_id(1);
    output[i*height + j] = input[j*width + i];
}

宽度和高度均为6400,实验结果为(执行时间为END和START事件之间的差):

and the width and height are both 6400, the experiment results are(execution time is the difference between END and START event):

work group size      execution time
x     y
4    64              24ms
64   4               169ms
256  1               654ms
1    256             34ms
8    32              27ms
1    1024            375ms
1024 1               657ms
32   32              26ms

此后,我做了另一个实验,将宽度和高度从6400更改为6401(以及NDRangeKernel调用中的全局工作量),结果更加有趣:

after this I did another experimemnt where I change the width and height from 6400 to 6401(and the global work size in NDRangeKernel call as well), and the result is even more interesing:

work group size      execution time
x     y
4    64              28ms
64   4               105ms
256  1               359ms
1    256             31ms
8    32              32ms
1    1024            99ms
1024 1               358ms
32   32              32ms

大多数情况下,

执行时间大大减少.我知道内存合并或缓存可以在这里发挥作用,但是我无法完全解释这一点.

execution time of most scenarios drops significantly. I know memory coalescing or cache could play a role here, but I cannot completely explain this.

推荐答案

当连续线程访问128字节对齐段中连续全局内存地址处的数据时,就会发生内存合并.然后将内存访问合并为一个,大大减少了总体延迟.

Memory coalescence occurs when consecutive threads access data at consecutive global memory addresses within a 128-byte aligned segment. Then memory accesses are coalesced into one, significantly reducing overall latency.

在2D范围内,合并仅沿您的情况沿get_global_id(1)j方向发生.在output[i*height + j] = input[j*width + i];行中,input[j*width + i];是未对齐的(非强制)读取,而output[i*height + j]是已合并的写入.合并的内存访问通常比未对齐的访问快得多,但是合并/未对齐的读取的性能损失可能与合并/未对齐的写入有很大的不同.在大多数台式机GPU架构上,未对齐读取和合并写入的组合比其他方法更快,请参见下图.因此,您的实现应该已经是更快的变体.

In the 2D range, coalescing only happens along get_global_id(1) or the j direction in your case. In the line output[i*height + j] = input[j*width + i];, input[j*width + i]; is a misaligned (non-coalesced) read and output[i*height + j] is a coalesced write. Coalesced memory access generally is much faster than misaligned access, but the performance penalty for coalesced/misaligned reads can be vastly different than coalesced/misaligned writes. On most desktop GPU architectures, the combination misaligned read and coalesced write is faster than the other way around, see the diagram below. So your implementation should be the faster variant already.

由于只能沿j索引进行合并访问,因此如果您具有范围(x=256,y=1)(沿x方向为i,沿y方向为j),则不会得到任何合并.对于(x=8,y=32),每个线程块将j合并为32 8次,因此内存带宽相当饱和并且性能良好.

Since coalesced access is only possible along the j index, if you have a range of (x=256,y=1) (i along x-direction, j along y-direction), you do not get any coalescing. For (x=8,y=32), j is coalesced in groups of 32 8 times per thread block, so memory bandwidth is fairly saturated and performance is good.

如果要获得最佳性能,建议您使用1D索引.这样,您就可以完全控制合并,并且合并发生在整个线程块上.您的矩阵转置内核将如下所示:

If you want maximum possible performance, I'd suggest you go with 1D indexing. This way you have full control about coalescing and coalescing happens over the entire thread block. Your matrix transpose kernel then would look like this:

#define width 6400
__kernel void transpose(__global float *input, __global float *output) {
    const int n = get_global_id(0);
    int i = n/width;
    int j = n%width;
    output[i*height + j] = input[j*width + i];
}

您可以在C ++运行时以及通过字符串连接在OpenCL编译之前将width烘焙到OpenCL Ccode中.

You can bake width into the OpenCL Ccode at C++ runtime and before OpenCL compile time via string concatenation.

这篇关于为什么OpenCL工作组规模会对GPU产生巨大的性能影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆