改善矩阵乘法的性能 [英] Improving the performance of Matrix Multiplication

查看:97
本文介绍了改善矩阵乘法的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的加速矩阵乘法的代码,但仅比简单代码快5%. 我该怎么做才能尽可能地增强它?

This is my code for speeding up matrix multiplication, but it is only 5% faster than the simple one. What can i do to boost it as much as possible?

*例如,以 C [i,j] 位置访问 C [sub2ind(i,j,n)] .

*The tables are being accessed for example as: C[sub2ind(i,j,n)] for the C[i, j] position.

void matrixMultFast(float * const C,            /* output matrix */
                float const * const A,      /* first matrix */
                float const * const B,      /* second matrix */
                int const n,                /* number of rows/cols */
                int const ib,               /* size of i block */
                int const jb,               /* size of j block */
                int const kb)               /* size of k block */
{

int i=0, j=0, jj=0, k=0, kk=0;
float sum;

for(i=0;i<n;i++)
    for(j=0;j<n;j++)
        C[sub2ind(i,j,n)]=0;

for(kk=0;kk<n;kk+=kb)
{
    for(jj=0;jj<n;jj+=jb)
    {
        for(i=0;i<n;i++)
        {
            for(j=jj;j<jj+jb;j++)
            {
                sum=C[sub2ind(i,j,n)];
                for(k=kk;k<kk+kb;k++)
                    sum += A[sub2ind(i,k,n)]*B[sub2ind(k,j,n)];
                C[sub2ind(i,j,n)]=sum;
            }
        }
    }
}
} // end function 'matrixMultFast4'

*它是用C编写的,需要支持C99

*It is written in C and it needs to support C99

推荐答案

您可以做很多事情来提高矩阵乘法的效率.

There are many, many things you can do to improve the efficiency of matrix multiplication.

要研究如何改进基本算法,首先让我们看一下当前的选项.当然,幼稚的实现有3个循环,时间复杂度为O(n^3).还有另一种称为Strassen方法的方法,该方法可实现明显的加速并具有O(n^2.73)的阶数(但是实际上它没有用,因为它没有提供明显的优化手段).

To examine how to improve the basic algorithm, let's first take a look at our current options. The naive implementation, of course, has 3 loops with a time complexity of the order of O(n^3). There is another method called Strassen's Method which achieves a appreciable speedup and has the order of O(n^2.73) (but in practice is useless since it offers no appreciable means of optimization).

从理论上讲.现在考虑矩阵如何存储在内存中.行专业是标准,但您也可以找到列专业.根据方案的不同,由于较少的缓存未命中,转置矩阵可能会提高速度.理论上,矩阵乘法只是一堆矢量点积和加法.相同的向量将由多个向量操作,因此有意义的是将该向量保留在高速缓存中以便更快地访问.

This is in theory. Now consider how matrices are stored in memory. Row major is the standard, but you find column major too. Depending on the scheme, transposing your matrix might improve speed due to fewer cache misses. Matrix multiplication in theory is just a bunch of vector dot products and addition. The same vector is to be operated upon by multiple vectors, thus it makes sense to keep that vector in cache for faster access.

现在,随着多核,并行性和缓存概念的引入,游戏发生了变化.如果我们仔细观察,会发现点积不过是一堆乘法,然后是求和.这些乘法可以并行完成.因此,我们现在可以看一下数字的并行加载.

Now, with the introduction of multiple cores, parallelism and the cache concept, the game changes. If we look a little closely, we see that a dot product is nothing but a bunch of multiplications followed by summations. These multiplications can be done in parallel. Hence, we can now look at parallel loading of numbers.

现在让我们变得更复杂一些.在谈论矩阵乘法时,单浮点数和双浮点数的大小有所区别.前者通常为32位,而后者为64位(当然,这取决于CPU).每个CPU仅具有固定数量的寄存器,这意味着您的数量越大,可容纳的CPU越少.故事的寓意是,除非您确实需要双重浮点数,否则请坚持使用单个浮点数.

Now let's make things a little more complicated. When talking about matrix multiplication, there is a distinction between single floating point and double floating point in their size. Often the former is 32 bits while the latter, 64 (of course, this depends on the CPU). Each CPU only has a fixed number of registers, meaning the bigger your numbers, the lesser you can fit in the CPU. Moral of the story is, stick to single floating point unless you really need double.

现在,我们已经了解了如何调整矩阵乘法的基础,不用担心.您不需要执行上面已讨论的任何操作,因为已经有子例程可以执行此操作.如评论中所述,有GotoBLAS,OpenBLAS,英特尔的MKL和苹果的Accelerate框架. MKL/Accelerate是专有的,但是OpenBLAS是一个非常有竞争力的选择.

Now that we've gone through the basics of how we can tune matrix multiplication, worry not. You need do nothing of what has been discussed above since there are already subroutines to do it. As mentioned in the comments, there's GotoBLAS, OpenBLAS, Intel's MKL, and Apple's Accelerate framework. MKL/Accelerate are proprietary, but OpenBLAS is a very competitive alternative.

这是一个很好的小例子,它在Macintosh上在几毫秒内将2个8k x 8k矩阵相乘:

Here's a nice little example that multiplies 2 8k x 8k matrices in a few milliseconds on my Macintosh:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <Accelerate/Accelerate.h>

int SIZE = 8192;

typedef float point_t;

point_t* transpose(point_t* A) {    
    point_t* At = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));    
    vDSP_mtrans(A, 1, At, 1, SIZE, SIZE);

    return At;
}

point_t* dot(point_t* A, point_t* B) {
    point_t* C = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));       
    int i;    
    int step = (SIZE * SIZE / 4);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[0], SIZE, B, SIZE, 0.0, &C[0], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step], SIZE, B, SIZE, 0.0, &C[step], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step * 2], SIZE, B, SIZE, 0.0, &C[step * 2], SIZE);

    cblas_sgemm (CblasRowMajor, 
       CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
       1.0, &A[step * 3], SIZE, B, SIZE, 0.0, &C[step * 3], SIZE);      

    return C;
}

void print(point_t* A) {
    int i, j;
    for(i = 0; i < SIZE; i++) {
        for(j = 0; j < SIZE; j++) {
            printf("%f  ", A[i * SIZE + j]);
        }
        printf("\n");
    }
}

int main() {
    for(; SIZE <= 8192; SIZE *= 2) {
        point_t* A = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
        point_t* B = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));

        srand(getpid());

        int i, j;
        for(i = 0; i < SIZE * SIZE; i++) {
            A[i] = ((point_t)rand() / (double)RAND_MAX);
            B[i] = ((point_t)rand() / (double)RAND_MAX);
        }

        struct timeval t1, t2;
        double elapsed_time;

        gettimeofday(&t1, NULL);
        point_t* C = dot(A, B);
        gettimeofday(&t2, NULL);

        elapsed_time = (t2.tv_sec - t1.tv_sec) * 1000.0;      // sec to ms
        elapsed_time += (t2.tv_usec - t1.tv_usec) / 1000.0;   // us to ms

        printf("Time taken for %d size matrix multiplication: %lf\n", SIZE, elapsed_time/1000.0);

        free(A);
        free(B);
        free(C);

    }
    return 0;
}

在这一点上,我还应该提到SSE(流式SIMD扩展),除非您已经进行过汇编工作,否则基本上这是您不应该做的事情.基本上,您正在 vectorising 您的C代码,以使用矢量而不是整数.这意味着您可以对数据块(而不是单个值)进行操作.编译器放弃了代码,只照原样翻译了代码,而没有进行自己的优化.如果操作正确,它可以像以前一样加快代码的速度-您甚至可以触及O(n^2)的理论底线!但是滥用SSE很容易,大多数人不幸地这样做,使最终结果比以前更糟.

At this point I should also mention SSE (Streaming SIMD Extension), which is basically something you shouldn't do unless you've worked with assembly. Basically, you're vectorising your C code, to work with vectors instead of integers. This means you can operate on blocks of data instead of single values. The compiler gives up and just translates your code as is without doing its own optimizations. If done right, it can speed up your code like nothing before - you can touch the theoretical floor of O(n^2) even! But it is easy to abuse SSE, and most people unfortunately do, making the end result worse than before.

我希望这能激发您进行更深入的研究.矩阵乘法的世界是一个巨大而迷人的世界.下面,我附上链接以供进一步阅读.

I hope this motivates you to dig deeper. The world of matrix multiplication is a large and fascinating one. Below, I attach links for further reading.

  1. OpenBLAS
  2. 有关SSE的更多信息
  3. 英特尔内在
  1. OpenBLAS
  2. More about SSE
  3. Intel Intrinsics

这篇关于改善矩阵乘法的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆