OpenCL矩阵乘法应该更快吗? [英] OpenCL matrix multiplication should be faster?

查看:100
本文介绍了OpenCL矩阵乘法应该更快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习如何使GPU优化的OpenCL内核,我以在本地内存中使用正方形图块的矩阵乘法为例.但是与numpy.dot()(5 Gflops,它使用的是BLAS)相比,我最多只能获得10倍的加速(〜50 Gflops).

我发现一些研究,其中它们的速度提高了200倍以上(> 1000 Gflops). ftp://ftp.u -aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf 我不知道自己在做什么错,还是因为我的GPU(nvidia GTX 275).还是因为某些pyOpenCl开销.但是我还测量了将结果从GPU复制到RAM所需的时间,这大约是矩阵乘法时间的10%.

#define BLOCK_SIZE 22 
__kernel void matrixMul(
      __global float* Cij, 
      __global float* Aik, 
      __global float* Bkj, 
      __const int ni, 
      __const int nj,
      __const int nk
){
//   WARRNING : interchange of  i  and  j  dimension  lower the performance >2x on my nV GT275 GPU    
int gj = get_global_id(0);    int gi = get_global_id(1); 
int bj = get_group_id(0);     int bi = get_group_id(1);  // Block index
int tj = get_local_id(0);     int ti = get_local_id(1);  // Thread index
int oj = bi*BLOCK_SIZE;       int oi = bj*BLOCK_SIZE; 
float Csub =0; 
__local float As   [BLOCK_SIZE][BLOCK_SIZE];
__local float Bs   [BLOCK_SIZE][BLOCK_SIZE];
for (int ok = 0; ok < nk; ok += BLOCK_SIZE )   {
    As[ti][tj] = Aik[ nk*(gi   ) + tj + ok ];   // A[i][k]
    Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ];        // B[k][j]
    barrier(CLK_LOCAL_MEM_FENCE);
    for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj];
    barrier(CLK_LOCAL_MEM_FENCE);
}
Cij[ nj * ( gi ) + gj ] = Csub;

}

注意-奇怪的BLOCK_SIZE = 22是最大的BLOCK_SIZE,它确实适合我的GPU上的max work_group_size,即512.在此代码中,必须保持条件BLOCK_SIZE ^ 2<最大work_group_size. 22 = int(sqrt(512)).我也尝试了BLOCK_SIZE = 16或8,但棕褐色为22时较慢.

我还尝试了简单的matrixMul(不使用本地内存),但它甚至比numpy.dot()慢10倍. 我在这里复制了代码 http://gpgpu-computing4.blogspot.cz/2009 /10/matrix-multiplication-3-opencl.html 他们说,即使是简单版本(没有本地内存),其运行速度也应比CPU快200倍?我不理解这一点.

在我看来,性能的依赖点是:

N =  220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464 
N =  330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205 
N =  440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548 
N =  550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217 
N =  660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522 
N =  770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638 
N =  880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152 
N =  990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478 
N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625 
N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581 
N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979  

注意-"Gflops"数字是以N ^ 3/次的形式获得的,它确实包括将结果从GPU复制到主存储器所需的时间,但这一次仅占总时间的百分之几,尤其是对于N> 1000

也许更多的图画是时间上的时间:

N =  220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000 
N =  330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683 
N =  440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565 
N =  550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957 
N =  660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298 
N =  770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638 
N =  880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097 
N =  990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581 
N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065 
N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581 
N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979

我当时认为,使用 async_work_group_copy或read_imageui来将块复制到本地内存. 但是我不明白为什么我使用的代码与那些说200倍加速的人基本相同的代码时为什么会有如此大的差异???

解决方案

即使不看您的代码,我也不会对您的基准做出一些评论.让我们忽略numpy并比较Intel CPU与Nvidia和AMD GPU的最大SP FLOPs/s和DP FLOPs/s.

4 GHz的Intel 2600K可以达到4 GHz *(8 AVX)*(2 ILP)*(4核)= 256 SP GFLOPs/s.对于DP来说是一半:128 DP GFLOPs/s.几周后问世的Haswell会使两者翻倍.英特尔MKL库在GEMM中的效率超过80%.我自己的GEMM代码在i7-2700上获得了70%的利润,因此您用numpy引用的5 GFlops/s很小,无法与之比拟.

我不知道GTX 275的功能如何,但我想它的速度远远超过50 GFLOP/s.

您所引用的文章比较了AMD7970.它们获得848(效率为90%)DP GFlops/s和2646(效率为70%)SP GFlops/s.这比CPU的性能接近10倍,而不是200倍!

您对FLOP的计算是错误的,应该为2.0 * n ^ 3.这仍然是近似值,但在渐近线上是正确的.让我解释一下.

考虑3D点产品.它是x1 * x2 + y1 * y2 + z1 * z2.那是3个乘法和两个加法.因此,N维点积是n个乘法和(n-1)个加法.矩阵乘积等效于nxn点乘积,即n * n * n乘法和n * n *(n-1)个加法.大约是2.0 * n ^ 3 FLOPS.因此,您应该将所有Gflops/s数字加倍.

您可能需要考虑内核时间.自从我使用OpenCL以来已经有一段时间了,但是使用C ++绑定我做了类似的事情

queue = cl::CommandQueue(context, devices[device], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
//other code...run kernel

time_end = clevent.getProfilingInfo<CL_PROFILING_COMMAND_END>();  
time_start = clevent.getProfilingInfo<CL_PROFILING_COMMAND_START>();

I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS).

I found studies where they got speedup >200x ( >1000 Gflops ). ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU to RAM and it is just ~10% of the matrix multiplication time.

#define BLOCK_SIZE 22 
__kernel void matrixMul(
      __global float* Cij, 
      __global float* Aik, 
      __global float* Bkj, 
      __const int ni, 
      __const int nj,
      __const int nk
){
//   WARRNING : interchange of  i  and  j  dimension  lower the performance >2x on my nV GT275 GPU    
int gj = get_global_id(0);    int gi = get_global_id(1); 
int bj = get_group_id(0);     int bi = get_group_id(1);  // Block index
int tj = get_local_id(0);     int ti = get_local_id(1);  // Thread index
int oj = bi*BLOCK_SIZE;       int oi = bj*BLOCK_SIZE; 
float Csub =0; 
__local float As   [BLOCK_SIZE][BLOCK_SIZE];
__local float Bs   [BLOCK_SIZE][BLOCK_SIZE];
for (int ok = 0; ok < nk; ok += BLOCK_SIZE )   {
    As[ti][tj] = Aik[ nk*(gi   ) + tj + ok ];   // A[i][k]
    Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ];        // B[k][j]
    barrier(CLK_LOCAL_MEM_FENCE);
    for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj];
    barrier(CLK_LOCAL_MEM_FENCE);
}
Cij[ nj * ( gi ) + gj ] = Csub;

}

NOTE - the strange BLOCK_SIZE=22 is the maximum BLOCK_SIZE which does fit to max work_group_size which is 512 on my GPU. In this code must hold condition BLOCK_SIZE^2 < max work_group_size. 22=int(sqrt(512)). I tried also BLOCK_SIZE=16 or 8 but it was slower tan with 22.

I also tried simple matrixMul (without using local memory) but it was even 10-times slower than numpy.dot(). I copied the code here http://gpgpu-computing4.blogspot.cz/2009/10/matrix-multiplication-3-opencl.html they say that even the simple version (without local memory) should run 200x faster than CPU? I don't undrestand that.

the dependecne of performance in my case is:

N =  220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464 
N =  330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205 
N =  440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548 
N =  550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217 
N =  660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522 
N =  770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638 
N =  880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152 
N =  990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478 
N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625 
N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581 
N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979  

NOTE - the "Gflops" number is obtained as N^3/time and it does include time required to copy results from GPU to main memory, but this time is just few percent of total time especially for N>1000

maybe more pictorial is time in secons:

N =  220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000 
N =  330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683 
N =  440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565 
N =  550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957 
N =  660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298 
N =  770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638 
N =  880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097 
N =  990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581 
N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065 
N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581 
N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979

I was thinking that maybe some speed improvement can be obtained using async_work_group_copy or even read_imageui to copy blocks to local memory. But I don't understand why I have so big difference when I'm using basically the same code as people who say they have 200x speedup?????

解决方案

Without even looking at your code let me make some comments about your benchmarks. Let's ignore numpy and compare the maximum SP FLOPs/s and DP FLOPs/s of an Intel CPU versus Nvidia and AMD GPUs.

A Intel 2600K at 4 GHz can do 4 GHz * (8 AVX) * (2 ILP) * ( 4 cores) = 256 SP GFLOPs/s. For DP it's half: 128 DP GFLOPs/s. Haswell which comes out in a few weeks will double both of those. The Intel MKL library gets better than 80% efficiency in GEMM. My own GEMM code gets 70% on my i7-2700 so the 5 GFlops/s you quote with numpy is tiny and not fair to compare with.

I don't know what the GTX 275 is capable of but I would guess it's much more than 50 GFLOPs/s.

The article you reference compares a AMD 7970. They get 848 (90% efficiency) DP GFlops/s and 2646 (70% efficiency) SP GFlops/s. That's closer to 10x the performance of the CPU not 200x!

Edit: Your calculations of FLOPs is wrong it should be 2.0*n^3. That's still approximate but it's asymptotically true. Let me explain.

Consider a 3D dot product. It's x1*x2+y1*y2+z1*z2. That's 3 multiplications and two additions. So a N-dimensional dot product is n multiplications and (n-1) additions. A matrix product is equivalent to nxn dot products, i.e. n*n*n multiplications and n*n*(n-1) additions. That's approximately 2.0*n^3 FLOPS. So you should double all your Gflops/s numbers.

Edit: You might want to consider the kernel time. It's been awhile since I used OpenCL but using the C++ bindings I did something like this

queue = cl::CommandQueue(context, devices[device], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
//other code...run kernel

time_end = clevent.getProfilingInfo<CL_PROFILING_COMMAND_END>();  
time_start = clevent.getProfilingInfo<CL_PROFILING_COMMAND_START>();

这篇关于OpenCL矩阵乘法应该更快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆