可以优化此OpenCL代码吗? [英] Can this OpenCL code be optimized?
问题描述
我正在为一个专门的矩阵函数编写一段OpencL代码:对于Dx1
向量v
,两个DxD
矩阵A
和B
和一个常量c
,返回r
,其中r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j])
I am working on a piece of OpencL code for a specialized matrix function: for a Dx1
vector v
, two DxD
matrices A
and B
and a constant c
, return 1xD
vector r
where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j])
以下是我到目前为止的内容,但是运行异常缓慢.不求和返回DxD
矩阵的版本大约快十倍.如果有任何区别,可以从PyOpenCL调用它.
Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD
matrix is about ten times faster. It's called from PyOpenCL if that makes any difference.
做错什么了吗?可以优化吗?
Is anything done wrong? Could it be optimized?
#define D 1000
...
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
int y = get_global_id(1);
float sum = 0;
for(int k = 0; k < D; k++)
{
sum += vector[k] * matrix[(y*D) + k]
* matrix2[(y*D) + k ];
}
result[y] = sum * factor;
}
干杯!
推荐答案
优化#1:使向量__local.
Optimization #1: make vector __local.
我的第一次通过在性能上有了不错的提高.我注意到每个vector [k]总共被读取了D次,因此我将其复制到__local.这仅是可能的,因为D足够小.上面的内核在5870和6970 gpu上都遭受0.08的严重ALU:fetch比率.即使是速度较慢的GPU仍在等待内存访问.
My first pass at this got a decent improvement in performance. I noticed that each vector[k] is read a total of D times, so I copied it to a __local. This is only possible because D is small enough to allow this. The kernel as you have it above suffers from a terrible ALU:fetch ratio of 0.08 on both the 5870 and the 6970 gpus. Even the slower gpus are still waiting on the memory access.
#define D 1000
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
int y = get_global_id(0);
float sum = 0;
__local float vectCopy[D];
int ls = get_local_size(0);
int lid = get_local_id(0);
for(int i=0;i<D;i+=ls){
vectCopy[i+lid] = vector[i+lid];
}
mem_fence(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < D; k++)
{
sum += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
}
result[y] = sum * factor;
}
通过此更改,APP Profiler显示5870和6970 gpus的新ALU:提取比率为0.20.同一张卡上的平均时间从1513-> 1034和1261-> 861更改.现在,低端GPU由ALU绑定,而不是由访存绑定. (大于4:1的比例)
With this change, APP profiler is showing a new ALU:fetch ratio of 0.20 for the 5870 and 6970 gpus. Average times changed from 1513-->1034, and 1261-->861 on the same cards. The low end gpus are now bound by ALU instead of fetch. (greater than 4:1 ratio)
优化#2:使用整个工作组计算每个结果[y].
Opimization #2: calculate each result[y] using an entire work group.
您将必须执行此ID D大得多(超过100k).这个想法是通过使用工作组一次计算结果的单个元素来获得最佳的内存访问模式.我在这里将ls(本地大小)定义为64,因为它可以在我的硬件以及大多数供应商的硬件上工作.除非更改该定义,否则从主机端使用的工作组大小必须为64.需要定义它以将sum [ls]存储创建为__local,而且我不喜欢将可变大小的__local vars传递到我的内核中.
You would have to do this id D were much larger (100k+). The idea is to get the best memory access pattern by using the work group to compute a single element of the result at a time. I defined ls (local size) to be 64 here, because it works on my hardware, as well as most vendors'. The workgroup size you use from the host-side will have to be 64 unless you change that definition. It needs to be defined to create the sum[ls] storage as __local, and I don't like passing variable sized __local vars into my kernels.
结果:5870 ALU:提取= 0.59:1,平均= 708. 6970 ALU:提取= 0.72,平均= 590.根据APP分析器,这大约是原始列表的两倍.
results: 5870 ALU:fetch=0.59:1, avg=708. 6970 ALU:fetch=0.72, avg=590. According to APP profiler, this is about twice as fast as your original listing.
#define D 1000
#define ls 64
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
__local float vectCopy[D];
int lid = get_local_id(0);
for(int i=0;i<D;i+=ls){
vectCopy[i+lid] = vector[i+lid];
}
mem_fence(CLK_LOCAL_MEM_FENCE);
int ng = get_num_groups(0);
int gid = get_group_id(0);
int y, k;
__local float sum[ls];
for(y = gid; y < D; y+=ng){
for(k = lid; k < D; k+=ls)
{
sum[lid] += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
}
if(lid==0){
result[y] = sum[0];
for(k=1;k<ls;k++){
result[y] += sum[k];
}
result[y] *= factor;
}
mem_fence(CLK_LOCAL_MEM_FENCE);
}
}
APP分析器= AMD APP KernelAnalyzer
APP profiler = AMD APP KernelAnalyzer
这篇关于可以优化此OpenCL代码吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!