为什么单纯的C ++矩阵乘法比BLAS慢100倍? [英] Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

查看:187
本文介绍了为什么单纯的C ++矩阵乘法比BLAS慢100倍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究大型矩阵乘法,并运行以下实验以形成基准测试:

  1. 从std normal(0平均值,1 stddev)随机生成两个4096x4096矩阵X,Y.
  2. Z = X * Y
  3. Z的元素的总和(以确保它们被访问)并输出.

这是朴素的C ++实现:

#include <iostream>
#include <algorithm>

using namespace std;

int main()
{
    constexpr size_t dim = 4096;

    float* x = new float[dim*dim];
    float* y = new float[dim*dim];
    float* z = new float[dim*dim];

    random_device rd;
    mt19937 gen(rd());
    normal_distribution<float> dist(0, 1);

    for (size_t i = 0; i < dim*dim; i++)
    {
        x[i] = dist(gen);
        y[i] = dist(gen);
    }

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
        {
            float acc = 0;

            for (size_t k = 0; k < dim; k++)
                acc += x[row*dim + k] * y[k*dim + col];

            z[row*dim + col] = acc;
        }

    float t = 0;

    for (size_t i = 0; i < dim*dim; i++)
        t += z[i];

    cout << t << endl;

    delete x;
    delete y;
    delete z;
}

编译并运行:

$ g++ -std=gnu++11 -O3 test.cpp
$ time ./a.out

这是Octave/matlab的实现:

X = stdnormal_rnd(4096, 4096);
Y = stdnormal_rnd(4096, 4096);
Z = X*Y;
sum(sum(Z))

运行:

$ time octave < test.octave

八度使用BLAS(我假设sgemm功能)

硬件为Linux x86-64上的i7 3930X,内存为24 GB. BLAS似乎正在使用两个核心.也许是超线程对?

我发现在-O3上用GCC 4.7编译的C ++版本执行了9分钟:

real    9m2.126s
user    9m0.302s
sys         0m0.052s

八度音阶版本花了6秒钟:

real    0m5.985s
user    0m10.881s
sys         0m0.144s

我了解BLAS已针对所有方面进行了优化,而朴素的算法完全忽略了缓存等等,但认真的说是90次?

有人可以解释这种区别吗? BLAS实施的架构到底是什么?我看到它正在使用Fortran,但是在CPU级别发生了什么?它使用什么算法?如何使用CPU缓存?它调用什么x86-64机器指令? (它使用的是AVX之类的高级CPU功能吗?)它从何处获得这种额外的速度?

C ++算法的哪些关键优化可以使其与BLAS版本相提并论?

我在gdb下运行了八度,并在计算中途停止了几次.它已经开始了第二个线程,这是堆栈(所有的停顿看起来都差不多):

(gdb) thread 1
#0  0x00007ffff6e17148 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffff1626721 in ATL_join_tree () from /usr/lib/libblas.so.3
#2  0x00007ffff1626702 in ATL_join_tree () from /usr/lib/libblas.so.3
#3  0x00007ffff15ae357 in ATL_dptgemm () from /usr/lib/libblas.so.3
#4  0x00007ffff1384b59 in atl_f77wrap_dgemm_ () from /usr/lib/libblas.so.3
#5  0x00007ffff193effa in dgemm_ () from /usr/lib/libblas.so.3
#6  0x00007ffff6049727 in xgemm(Matrix const&, Matrix const&, blas_trans_type, blas_trans_type) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#7  0x00007ffff6049954 in operator*(Matrix const&, Matrix const&) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#8  0x00007ffff7839e4e in ?? () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#9  0x00007ffff765a93a in do_binary_op(octave_value::binary_op, octave_value const&, octave_value const&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#10 0x00007ffff76c4190 in tree_binary_expression::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#11 0x00007ffff76c33a5 in tree_simple_assignment::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#12 0x00007ffff76d0864 in tree_evaluator::visit_statement(tree_statement&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#13 0x00007ffff76cffae in tree_evaluator::visit_statement_list(tree_statement_list&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#14 0x00007ffff757f6d4 in main_loop() () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#15 0x00007ffff7527abf in octave_main () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1

(gdb) thread 2
#0  0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
(gdb) bt
#0  0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
#1  0x00007ffff15a5fd7 in ATL_dmmIJK2 () from /usr/lib/libblas.so.3
#2  0x00007ffff15a6ae4 in ATL_dmmIJK () from /usr/lib/libblas.so.3
#3  0x00007ffff1518e65 in ATL_dgemm () from /usr/lib/libblas.so.3
#4  0x00007ffff15adf7a in ATL_dptgemm0 () from /usr/lib/libblas.so.3
#5  0x00007ffff6e15e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007ffff6b41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x0000000000000000 in ?? ()

它正在按预期方式调用BLAS gemm.

第一个线程似乎正在加入第二个线程,因此我不确定这两个线程是否占观察到的200%CPU使用率.

哪个库是ATL_dgemm libblas.so.3,它的代码在哪里?

$ ls -al /usr/lib/libblas.so.3
/usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3

$ ls -al /etc/alternatives/libblas.so.3
/etc/alternatives/libblas.so.3 -> /usr/lib/atlas-base/atlas/libblas.so.3

$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3
/usr/lib/atlas-base/atlas/libblas.so.3 -> libblas.so.3.0

$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3.0
/usr/lib/atlas-base/atlas/libblas.so.3.0

$ dpkg -S /usr/lib/atlas-base/atlas/libblas.so.3.0
libatlas3-base: /usr/lib/atlas-base/atlas/libblas.so.3.0

$ apt-get source libatlas3-base

它是ATLAS 3.8.4

这是我后来实现的优化:

使用平铺方法,我将64x64的X,Y和Z块预加载到单独的数组中.

更改每个块的计算,以使内部循环如下所示:

for (size_t tcol = 0; tcol < block_width; tcol++)
    bufz[trow][tcol] += B * bufy[tk][tcol];

这允许GCC自动矢量化为SIMD指令,并且还允许指令级并行性(我认为).

打开march=corei7-avx.这样可以获得30%的额外速度,但由于我认为BLAS库是预先构建的,所以是作弊的.

这是代码:

#include <iostream>
#include <algorithm>

using namespace std;

constexpr size_t dim = 4096;
constexpr size_t block_width = 64;
constexpr size_t num_blocks = dim / block_width;

double X[dim][dim], Y[dim][dim], Z[dim][dim];

double bufx[block_width][block_width];
double bufy[block_width][block_width];
double bufz[block_width][block_width];

void calc_block()
{
    for (size_t trow = 0; trow < block_width; trow++)
        for (size_t tk = 0; tk < block_width; tk++)
        {
            double B = bufx[trow][tk];

            for (size_t tcol = 0; tcol < block_width; tcol++)
                bufz[trow][tcol] += B * bufy[tk][tcol];
        }
}

int main()
{
    random_device rd;
    mt19937 gen(rd());
    normal_distribution<double> dist(0, 1);

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
        {
            X[row][col] = dist(gen);
            Y[row][col] = dist(gen);
            Z[row][col] = 0;
        }

    for (size_t block_row = 0; block_row < num_blocks; block_row++)
        for (size_t block_col = 0; block_col < num_blocks; block_col++)
        {
            for (size_t trow = 0; trow < block_width; trow++)
                for (size_t tcol = 0; tcol < block_width; tcol++)
                    bufz[trow][tcol] = 0;

            for (size_t block_k = 0; block_k < num_blocks; block_k++)
            {
                for (size_t trow = 0; trow < block_width; trow++)
                    for (size_t tcol = 0; tcol < block_width; tcol++)
                    {
                        bufx[trow][tcol] = X[block_row*block_width + trow][block_k*block_width + tcol];
                        bufy[trow][tcol] = Y[block_k*block_width + trow][block_col*block_width + tcol];
                    }

                calc_block();
            }

            for (size_t trow = 0; trow < block_width; trow++)
                for (size_t tcol = 0; tcol < block_width; tcol++)
                    Z[block_row*block_width + trow][block_col*block_width + tcol] = bufz[trow][tcol];

        }

    double t = 0;

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
            t += Z[row][col];

    cout << t << endl;
}

所有操作都在calc_block函数中-花费了90%以上的时间.

新时间是:

real    0m17.370s
user    0m17.213s
sys 0m0.092s

距离更近.

calc_block函数的反编译如下:

0000000000401460 <_Z10calc_blockv>:
  401460:   b8 e0 21 60 00          mov    $0x6021e0,%eax
  401465:   41 b8 e0 23 61 00       mov    $0x6123e0,%r8d
  40146b:   31 ff                   xor    %edi,%edi
  40146d:   49 29 c0                sub    %rax,%r8
  401470:   49 8d 34 00             lea    (%r8,%rax,1),%rsi
  401474:   48 89 f9                mov    %rdi,%rcx
  401477:   ba e0 a1 60 00          mov    $0x60a1e0,%edx
  40147c:   48 c1 e1 09             shl    $0x9,%rcx
  401480:   48 81 c1 e0 21 61 00    add    $0x6121e0,%rcx
  401487:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40148e:   00 00 
  401490:   c4 e2 7d 19 01          vbroadcastsd (%rcx),%ymm0
  401495:   48 83 c1 08             add    $0x8,%rcx
  401499:   c5 fd 59 0a             vmulpd (%rdx),%ymm0,%ymm1
  40149d:   c5 f5 58 08             vaddpd (%rax),%ymm1,%ymm1
  4014a1:   c5 fd 29 08             vmovapd %ymm1,(%rax)
  4014a5:   c5 fd 59 4a 20          vmulpd 0x20(%rdx),%ymm0,%ymm1
  4014aa:   c5 f5 58 48 20          vaddpd 0x20(%rax),%ymm1,%ymm1
  4014af:   c5 fd 29 48 20          vmovapd %ymm1,0x20(%rax)
  4014b4:   c5 fd 59 4a 40          vmulpd 0x40(%rdx),%ymm0,%ymm1
  4014b9:   c5 f5 58 48 40          vaddpd 0x40(%rax),%ymm1,%ymm1
  4014be:   c5 fd 29 48 40          vmovapd %ymm1,0x40(%rax)
  4014c3:   c5 fd 59 4a 60          vmulpd 0x60(%rdx),%ymm0,%ymm1
  4014c8:   c5 f5 58 48 60          vaddpd 0x60(%rax),%ymm1,%ymm1
  4014cd:   c5 fd 29 48 60          vmovapd %ymm1,0x60(%rax)
  4014d2:   c5 fd 59 8a 80 00 00    vmulpd 0x80(%rdx),%ymm0,%ymm1
  4014d9:   00 
  4014da:   c5 f5 58 88 80 00 00    vaddpd 0x80(%rax),%ymm1,%ymm1
  4014e1:   00 
  4014e2:   c5 fd 29 88 80 00 00    vmovapd %ymm1,0x80(%rax)
  4014e9:   00 
  4014ea:   c5 fd 59 8a a0 00 00    vmulpd 0xa0(%rdx),%ymm0,%ymm1
  4014f1:   00 
  4014f2:   c5 f5 58 88 a0 00 00    vaddpd 0xa0(%rax),%ymm1,%ymm1
  4014f9:   00 
  4014fa:   c5 fd 29 88 a0 00 00    vmovapd %ymm1,0xa0(%rax)
  401501:   00 
  401502:   c5 fd 59 8a c0 00 00    vmulpd 0xc0(%rdx),%ymm0,%ymm1
  401509:   00 
  40150a:   c5 f5 58 88 c0 00 00    vaddpd 0xc0(%rax),%ymm1,%ymm1
  401511:   00 
  401512:   c5 fd 29 88 c0 00 00    vmovapd %ymm1,0xc0(%rax)
  401519:   00 
  40151a:   c5 fd 59 8a e0 00 00    vmulpd 0xe0(%rdx),%ymm0,%ymm1
  401521:   00 
  401522:   c5 f5 58 88 e0 00 00    vaddpd 0xe0(%rax),%ymm1,%ymm1
  401529:   00 
  40152a:   c5 fd 29 88 e0 00 00    vmovapd %ymm1,0xe0(%rax)
  401531:   00 
  401532:   c5 fd 59 8a 00 01 00    vmulpd 0x100(%rdx),%ymm0,%ymm1
  401539:   00 
  40153a:   c5 f5 58 88 00 01 00    vaddpd 0x100(%rax),%ymm1,%ymm1
  401541:   00 
  401542:   c5 fd 29 88 00 01 00    vmovapd %ymm1,0x100(%rax)
  401549:   00 
  40154a:   c5 fd 59 8a 20 01 00    vmulpd 0x120(%rdx),%ymm0,%ymm1
  401551:   00 
  401552:   c5 f5 58 88 20 01 00    vaddpd 0x120(%rax),%ymm1,%ymm1
  401559:   00 
  40155a:   c5 fd 29 88 20 01 00    vmovapd %ymm1,0x120(%rax)
  401561:   00 
  401562:   c5 fd 59 8a 40 01 00    vmulpd 0x140(%rdx),%ymm0,%ymm1
  401569:   00 
  40156a:   c5 f5 58 88 40 01 00    vaddpd 0x140(%rax),%ymm1,%ymm1
  401571:   00 
  401572:   c5 fd 29 88 40 01 00    vmovapd %ymm1,0x140(%rax)
  401579:   00 
  40157a:   c5 fd 59 8a 60 01 00    vmulpd 0x160(%rdx),%ymm0,%ymm1
  401581:   00 
  401582:   c5 f5 58 88 60 01 00    vaddpd 0x160(%rax),%ymm1,%ymm1
  401589:   00 
  40158a:   c5 fd 29 88 60 01 00    vmovapd %ymm1,0x160(%rax)
  401591:   00 
  401592:   c5 fd 59 8a 80 01 00    vmulpd 0x180(%rdx),%ymm0,%ymm1
  401599:   00 
  40159a:   c5 f5 58 88 80 01 00    vaddpd 0x180(%rax),%ymm1,%ymm1
  4015a1:   00 
  4015a2:   c5 fd 29 88 80 01 00    vmovapd %ymm1,0x180(%rax)
  4015a9:   00 
  4015aa:   c5 fd 59 8a a0 01 00    vmulpd 0x1a0(%rdx),%ymm0,%ymm1
  4015b1:   00 
  4015b2:   c5 f5 58 88 a0 01 00    vaddpd 0x1a0(%rax),%ymm1,%ymm1
  4015b9:   00 
  4015ba:   c5 fd 29 88 a0 01 00    vmovapd %ymm1,0x1a0(%rax)
  4015c1:   00 
  4015c2:   c5 fd 59 8a c0 01 00    vmulpd 0x1c0(%rdx),%ymm0,%ymm1
  4015c9:   00 
  4015ca:   c5 f5 58 88 c0 01 00    vaddpd 0x1c0(%rax),%ymm1,%ymm1
  4015d1:   00 
  4015d2:   c5 fd 29 88 c0 01 00    vmovapd %ymm1,0x1c0(%rax)
  4015d9:   00 
  4015da:   c5 fd 59 82 e0 01 00    vmulpd 0x1e0(%rdx),%ymm0,%ymm0
  4015e1:   00 
  4015e2:   c5 fd 58 80 e0 01 00    vaddpd 0x1e0(%rax),%ymm0,%ymm0
  4015e9:   00 
  4015ea:   48 81 c2 00 02 00 00    add    $0x200,%rdx
  4015f1:   48 39 ce                cmp    %rcx,%rsi
  4015f4:   c5 fd 29 80 e0 01 00    vmovapd %ymm0,0x1e0(%rax)
  4015fb:   00 
  4015fc:   0f 85 8e fe ff ff       jne    401490 <_Z10calc_blockv+0x30>
  401602:   48 83 c7 01             add    $0x1,%rdi
  401606:   48 05 00 02 00 00       add    $0x200,%rax
  40160c:   48 83 ff 40             cmp    $0x40,%rdi
  401610:   0f 85 5a fe ff ff       jne    401470 <_Z10calc_blockv+0x10>
  401616:   c5 f8 77                vzeroupper 
  401619:   c3                      retq   
  40161a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

解决方案

以下是造成代码与BLAS之间性能差异的三个因素(加上有关Strassen算法的注释).

在您的内部循环中,在k上,您有y[k*dim + col].由于安排了内存缓存的方式,具有相同dimcolk的连续值映射到相同的缓存集.缓存的结构方式是,每个内存地址都有一个缓存集,在缓存中必须保留其内容.每个高速缓存集都有几行(通常是四行),这些行中的每一行都可以容纳映射到该特定高速缓存集的任何内存地址.

因为您的内部循环以这种方式遍历y,所以每次使用y中的元素时,它都必须将该元素的内存加载到与上一次迭代相同的集合中.这将强制清除集合中先前的缓存行之一.然后,在col循环的下一次迭代中,y的所有元素都已从缓存中清除,因此必须重新加载它们.

因此,每次每次循环加载y元素时,都必须从内存中加载它,这需要很多CPU周期.

高性能代码通过两种方式避免了这种情况.第一,它将工作分成较小的块.将行和列划分为较小的大小,并使用较短的循环进行处理,这些循环可以使用高速缓存行中的所有元素,并可以在继续进行下一个块之前多次使用每个元素.因此,对x元素和y元素的大多数引用通常来自缓存,通常是在单个处理器周期中.二,在某些情况下,代码会将数据从矩阵的一列(由于几何结构而影响缓存)中复制到临时缓冲区的行中(以避免影响).再次允许大多数内存引用从缓存而不是内存提供.

另一个因素是使用单指令多数据(SIMD)功能.许多现代处理器都有指令,它们在一条指令中加载多个元素(典型的是四个float元素,但现在有八个),存储多个元素,添加多个元素(例如,对于这四个元素中的每一个,将其添加到对应的元素中)这四个中的一个),将多个元素相乘,依此类推.只要能够安排使用这些指令的工作,只需使用这些指令即可立即使您的代码快四倍.

这些指令无法在标准C中直接访问.某些优化器现在尝试在可能的情况下使用此类指令,但是这种优化很困难,并且从中获得很多好处并不常见.许多编译器提供了对语言的扩展,这些语言可以访问这些指令.就个人而言,我通常更喜欢使用汇编语言编写以使用SIMD.

另一个因素是在处理器上使用指令级并行执行功能.请注意,在您的内部循环中,acc已更新.下一个迭代不能添加到acc中,直到上一个迭代已完成对acc的更新.高性能代码将使多个和并行运行(甚至多个SIMD和).这样的结果是,在执行一个总和的加法运算时,将开始另一个总和的加法运算.在当今的处理器上,通常一次支持四个或更多正在进行的浮点运算.如所写,您的代码根本无法做到这一点.一些编译器将尝试通过重新排列循环来优化代码,但这要求编译器能够看到特定循环的迭代是彼此独立的,或者可以与另一个循环互换,等等.

有效地使用缓存可以使性能提高十倍,SIMD可以提供另外四项,而指令级并行性可以提供另外四项,共计160,这是完全可行的.

根据此Wikipedia页面,这是对Strassen算法效果的非常粗略的估计. Wikipedia页面上说Strassen优于n = 100附近的直接乘法.这表明执行时间的常数因子之比为100 3 /100 2.807 ≈2.4 .显然,这将根据处理器模型,与缓存效果相互作用的矩阵大小等而有很大差异.但是,简单的外推法显示,在n = 4096((4096/100) 3-2.807 ≈2.05)时,斯特拉森大约是直接乘法的两倍.再次,这只是一个大概的估计.

对于以后的优化,请在内部循环中考虑以下代码:

bufz[trow][tcol] += B * bufy[tk][tcol];

与此相关的一个潜在问题是,bufz通常可以与bufy重叠.由于您对bufzbufy使用全局定义,因此编译器可能知道在这种情况下它们不重叠.但是,如果将此代码移到作为参数传递bufzbufy的子例程中,尤其是如果在单独的源文件中编译该子例程,则编译器不太可能知道bufzbufy不重叠.在这种情况下,编译器无法矢量化代码或对代码重新排序,因为此迭代中的bufz[trow][tcol]可能与另一迭代中的bufy[tk][tcol]相同.

即使编译器可以看到在当前源模块中使用不同的bufzbufy调用了子例程,如果该例程具有extern链接(默认),则编译器也必须允许要从外部模块调用该例程,因此如果bufzbufy重叠,则该例程必须生成可以正常工作的代码. (编译器处理此问题的一种方法是生成例程的两个版本,一个版本从外部模块调用,一个版本从当前正在编译的模块调用.是否这样做取决于您的编译器,优化会切换,等.)如果将例程声明为static,则编译器知道无法从外部模块调用该例程(除非您获取其地址,并且该地址有可能传递到当前模块的外部).

另一个潜在的问题是,即使编译器对此代码进行矢量化处理,也不一定会为您在其上执行的处理器生成最佳代码.查看生成的汇编代码,看来编译器仅重复使用%ymm1.一遍又一遍,它将来自内存的值乘以%ymm1,将来自内存的值乘以%ymm1,然后将来自%ymm1的值存储到内存.这有两个问题.

一个,您不希望这些部分总和频繁地存储到内存中.您希望将许多附加值累加到一个寄存器中,并且该寄存器将很少被写入存储器.要说服编译器执行此操作,可能需要重写代码,以明确地将部分和保留在临时对象中,并在循环完成后将其写入内存.

二,这些指令名义上是串行相关的.在乘法完成之前,添加操作无法开始,并且在添加完成之前,存储区无法写入内存. Core i7具有出色的无序执行功能.因此,尽管它具有等待开始执行的加法,但稍后会在指令流中查看乘法并启动它. (即使该乘法也使用了%ymm1,处理器也会动态地重新映射寄存器,以便它使用不同的内部寄存器来执行此乘法.)即使您的代码充满了连续的依赖关系,处理器也会尝试执行多个一次说明.但是,许多因素都会干扰这一点.您可以用尽处理器用于重命名的内部寄存器.您使用的内存地址可能会出现错误冲突. (处理器查看大约十二个内存地址的低位,以查看该地址是否与它试图从较早的指令加载或存储的另一个地址相同.如果位相等,则处理器具有延迟当前的加载或存储,直到它可以验证整个地址是否不同为止.这种延迟可能比当前的加载或存储延迟更多.)因此,最好使指令具有明显的独立性.

这是我更喜欢在汇编中编写高性能代码的另一个原因.要在C语言中执行此操作,您必须通过编写诸如编写自己的SIMD代码(使用它们的语言扩展)以及手动展开循环(编写多个迭代)之类的方法,说服编译器为您提供此类指令.

复制进出缓冲区时,可能会出现类似的问题.但是,您报告90%的时间都花在了calc_block中,所以我没有仔细研究.

此外,斯特拉森的算法是否解决了其余的任何差异?

I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test:

  1. Randomly generate two 4096x4096 matrixes X, Y from std normal (0 mean, 1 stddev).
  2. Z = X*Y
  3. Sum elements of Z (to make sure they are accessed) and output.

Here is the naïve C++ implementatation:

#include <iostream>
#include <algorithm>

using namespace std;

int main()
{
    constexpr size_t dim = 4096;

    float* x = new float[dim*dim];
    float* y = new float[dim*dim];
    float* z = new float[dim*dim];

    random_device rd;
    mt19937 gen(rd());
    normal_distribution<float> dist(0, 1);

    for (size_t i = 0; i < dim*dim; i++)
    {
        x[i] = dist(gen);
        y[i] = dist(gen);
    }

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
        {
            float acc = 0;

            for (size_t k = 0; k < dim; k++)
                acc += x[row*dim + k] * y[k*dim + col];

            z[row*dim + col] = acc;
        }

    float t = 0;

    for (size_t i = 0; i < dim*dim; i++)
        t += z[i];

    cout << t << endl;

    delete x;
    delete y;
    delete z;
}

Compile and run:

$ g++ -std=gnu++11 -O3 test.cpp
$ time ./a.out

Here is the Octave/matlab implementation:

X = stdnormal_rnd(4096, 4096);
Y = stdnormal_rnd(4096, 4096);
Z = X*Y;
sum(sum(Z))

Run:

$ time octave < test.octave

Octave under the hood is using BLAS (I assume the sgemm function)

The hardware is i7 3930X on Linux x86-64 with 24 GB of ram. BLAS appears to be using two cores. Perhaps a hyperthreaded pair?

I found that the C++ version compiled with GCC 4.7 on -O3 took 9 minutes to execute:

real    9m2.126s
user    9m0.302s
sys         0m0.052s

The octave version took 6 seconds:

real    0m5.985s
user    0m10.881s
sys         0m0.144s

I understand that BLAS is optimized to all hell, and the naïve algorithm is totally ignoring caches and so on, but seriously -- 90 times?

Can anyone explain this difference? What exactly is the architecture of the BLAS implementation? I see it is using Fortran, but what is happening at the CPU level? What algorithm is it using? How is it using the CPU caches? What x86-64 machine instructions does it call? (Is it using advanced CPU features like AVX?) Where does it get this extra speed from?

Which key optimizations to the C++ algorithm could get it on par with the BLAS version?

I ran octave under gdb and stopped it half way through computation a few times. It had started a second thread and here are the stacks (all stops it looked similar):

(gdb) thread 1
#0  0x00007ffff6e17148 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffff1626721 in ATL_join_tree () from /usr/lib/libblas.so.3
#2  0x00007ffff1626702 in ATL_join_tree () from /usr/lib/libblas.so.3
#3  0x00007ffff15ae357 in ATL_dptgemm () from /usr/lib/libblas.so.3
#4  0x00007ffff1384b59 in atl_f77wrap_dgemm_ () from /usr/lib/libblas.so.3
#5  0x00007ffff193effa in dgemm_ () from /usr/lib/libblas.so.3
#6  0x00007ffff6049727 in xgemm(Matrix const&, Matrix const&, blas_trans_type, blas_trans_type) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#7  0x00007ffff6049954 in operator*(Matrix const&, Matrix const&) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#8  0x00007ffff7839e4e in ?? () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#9  0x00007ffff765a93a in do_binary_op(octave_value::binary_op, octave_value const&, octave_value const&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#10 0x00007ffff76c4190 in tree_binary_expression::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#11 0x00007ffff76c33a5 in tree_simple_assignment::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#12 0x00007ffff76d0864 in tree_evaluator::visit_statement(tree_statement&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#13 0x00007ffff76cffae in tree_evaluator::visit_statement_list(tree_statement_list&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#14 0x00007ffff757f6d4 in main_loop() () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#15 0x00007ffff7527abf in octave_main () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1

(gdb) thread 2
#0  0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
(gdb) bt
#0  0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
#1  0x00007ffff15a5fd7 in ATL_dmmIJK2 () from /usr/lib/libblas.so.3
#2  0x00007ffff15a6ae4 in ATL_dmmIJK () from /usr/lib/libblas.so.3
#3  0x00007ffff1518e65 in ATL_dgemm () from /usr/lib/libblas.so.3
#4  0x00007ffff15adf7a in ATL_dptgemm0 () from /usr/lib/libblas.so.3
#5  0x00007ffff6e15e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007ffff6b41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x0000000000000000 in ?? ()

It is calling BLAS gemm as expected.

The first thread appears to be joining the second, so I am not sure whether these two threads account for the 200% CPU usage observed or not.

Which library is ATL_dgemm libblas.so.3 and where is its code?

$ ls -al /usr/lib/libblas.so.3
/usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3

$ ls -al /etc/alternatives/libblas.so.3
/etc/alternatives/libblas.so.3 -> /usr/lib/atlas-base/atlas/libblas.so.3

$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3
/usr/lib/atlas-base/atlas/libblas.so.3 -> libblas.so.3.0

$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3.0
/usr/lib/atlas-base/atlas/libblas.so.3.0

$ dpkg -S /usr/lib/atlas-base/atlas/libblas.so.3.0
libatlas3-base: /usr/lib/atlas-base/atlas/libblas.so.3.0

$ apt-get source libatlas3-base

It is ATLAS 3.8.4

Here are the optimizations I later implemented:

Using a tiled approach where I preload 64x64 blocks of X, Y and Z into separate arrays.

Changing the calculation of each block so that the inner loop looks like this:

for (size_t tcol = 0; tcol < block_width; tcol++)
    bufz[trow][tcol] += B * bufy[tk][tcol];

This allows GCC to autovectorize to SIMD instructions and also allows for instruction level parallelism (I think).

Turning on march=corei7-avx. This gains 30% extra speed but is cheating because I think the BLAS library is prebuilt.

Here is the code:

#include <iostream>
#include <algorithm>

using namespace std;

constexpr size_t dim = 4096;
constexpr size_t block_width = 64;
constexpr size_t num_blocks = dim / block_width;

double X[dim][dim], Y[dim][dim], Z[dim][dim];

double bufx[block_width][block_width];
double bufy[block_width][block_width];
double bufz[block_width][block_width];

void calc_block()
{
    for (size_t trow = 0; trow < block_width; trow++)
        for (size_t tk = 0; tk < block_width; tk++)
        {
            double B = bufx[trow][tk];

            for (size_t tcol = 0; tcol < block_width; tcol++)
                bufz[trow][tcol] += B * bufy[tk][tcol];
        }
}

int main()
{
    random_device rd;
    mt19937 gen(rd());
    normal_distribution<double> dist(0, 1);

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
        {
            X[row][col] = dist(gen);
            Y[row][col] = dist(gen);
            Z[row][col] = 0;
        }

    for (size_t block_row = 0; block_row < num_blocks; block_row++)
        for (size_t block_col = 0; block_col < num_blocks; block_col++)
        {
            for (size_t trow = 0; trow < block_width; trow++)
                for (size_t tcol = 0; tcol < block_width; tcol++)
                    bufz[trow][tcol] = 0;

            for (size_t block_k = 0; block_k < num_blocks; block_k++)
            {
                for (size_t trow = 0; trow < block_width; trow++)
                    for (size_t tcol = 0; tcol < block_width; tcol++)
                    {
                        bufx[trow][tcol] = X[block_row*block_width + trow][block_k*block_width + tcol];
                        bufy[trow][tcol] = Y[block_k*block_width + trow][block_col*block_width + tcol];
                    }

                calc_block();
            }

            for (size_t trow = 0; trow < block_width; trow++)
                for (size_t tcol = 0; tcol < block_width; tcol++)
                    Z[block_row*block_width + trow][block_col*block_width + tcol] = bufz[trow][tcol];

        }

    double t = 0;

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
            t += Z[row][col];

    cout << t << endl;
}

All the action is in the calc_block function - over 90% of the time is spent in it.

The new time is:

real    0m17.370s
user    0m17.213s
sys 0m0.092s

Which is much closer.

The decompile of the calc_block function is as follows:

0000000000401460 <_Z10calc_blockv>:
  401460:   b8 e0 21 60 00          mov    $0x6021e0,%eax
  401465:   41 b8 e0 23 61 00       mov    $0x6123e0,%r8d
  40146b:   31 ff                   xor    %edi,%edi
  40146d:   49 29 c0                sub    %rax,%r8
  401470:   49 8d 34 00             lea    (%r8,%rax,1),%rsi
  401474:   48 89 f9                mov    %rdi,%rcx
  401477:   ba e0 a1 60 00          mov    $0x60a1e0,%edx
  40147c:   48 c1 e1 09             shl    $0x9,%rcx
  401480:   48 81 c1 e0 21 61 00    add    $0x6121e0,%rcx
  401487:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40148e:   00 00 
  401490:   c4 e2 7d 19 01          vbroadcastsd (%rcx),%ymm0
  401495:   48 83 c1 08             add    $0x8,%rcx
  401499:   c5 fd 59 0a             vmulpd (%rdx),%ymm0,%ymm1
  40149d:   c5 f5 58 08             vaddpd (%rax),%ymm1,%ymm1
  4014a1:   c5 fd 29 08             vmovapd %ymm1,(%rax)
  4014a5:   c5 fd 59 4a 20          vmulpd 0x20(%rdx),%ymm0,%ymm1
  4014aa:   c5 f5 58 48 20          vaddpd 0x20(%rax),%ymm1,%ymm1
  4014af:   c5 fd 29 48 20          vmovapd %ymm1,0x20(%rax)
  4014b4:   c5 fd 59 4a 40          vmulpd 0x40(%rdx),%ymm0,%ymm1
  4014b9:   c5 f5 58 48 40          vaddpd 0x40(%rax),%ymm1,%ymm1
  4014be:   c5 fd 29 48 40          vmovapd %ymm1,0x40(%rax)
  4014c3:   c5 fd 59 4a 60          vmulpd 0x60(%rdx),%ymm0,%ymm1
  4014c8:   c5 f5 58 48 60          vaddpd 0x60(%rax),%ymm1,%ymm1
  4014cd:   c5 fd 29 48 60          vmovapd %ymm1,0x60(%rax)
  4014d2:   c5 fd 59 8a 80 00 00    vmulpd 0x80(%rdx),%ymm0,%ymm1
  4014d9:   00 
  4014da:   c5 f5 58 88 80 00 00    vaddpd 0x80(%rax),%ymm1,%ymm1
  4014e1:   00 
  4014e2:   c5 fd 29 88 80 00 00    vmovapd %ymm1,0x80(%rax)
  4014e9:   00 
  4014ea:   c5 fd 59 8a a0 00 00    vmulpd 0xa0(%rdx),%ymm0,%ymm1
  4014f1:   00 
  4014f2:   c5 f5 58 88 a0 00 00    vaddpd 0xa0(%rax),%ymm1,%ymm1
  4014f9:   00 
  4014fa:   c5 fd 29 88 a0 00 00    vmovapd %ymm1,0xa0(%rax)
  401501:   00 
  401502:   c5 fd 59 8a c0 00 00    vmulpd 0xc0(%rdx),%ymm0,%ymm1
  401509:   00 
  40150a:   c5 f5 58 88 c0 00 00    vaddpd 0xc0(%rax),%ymm1,%ymm1
  401511:   00 
  401512:   c5 fd 29 88 c0 00 00    vmovapd %ymm1,0xc0(%rax)
  401519:   00 
  40151a:   c5 fd 59 8a e0 00 00    vmulpd 0xe0(%rdx),%ymm0,%ymm1
  401521:   00 
  401522:   c5 f5 58 88 e0 00 00    vaddpd 0xe0(%rax),%ymm1,%ymm1
  401529:   00 
  40152a:   c5 fd 29 88 e0 00 00    vmovapd %ymm1,0xe0(%rax)
  401531:   00 
  401532:   c5 fd 59 8a 00 01 00    vmulpd 0x100(%rdx),%ymm0,%ymm1
  401539:   00 
  40153a:   c5 f5 58 88 00 01 00    vaddpd 0x100(%rax),%ymm1,%ymm1
  401541:   00 
  401542:   c5 fd 29 88 00 01 00    vmovapd %ymm1,0x100(%rax)
  401549:   00 
  40154a:   c5 fd 59 8a 20 01 00    vmulpd 0x120(%rdx),%ymm0,%ymm1
  401551:   00 
  401552:   c5 f5 58 88 20 01 00    vaddpd 0x120(%rax),%ymm1,%ymm1
  401559:   00 
  40155a:   c5 fd 29 88 20 01 00    vmovapd %ymm1,0x120(%rax)
  401561:   00 
  401562:   c5 fd 59 8a 40 01 00    vmulpd 0x140(%rdx),%ymm0,%ymm1
  401569:   00 
  40156a:   c5 f5 58 88 40 01 00    vaddpd 0x140(%rax),%ymm1,%ymm1
  401571:   00 
  401572:   c5 fd 29 88 40 01 00    vmovapd %ymm1,0x140(%rax)
  401579:   00 
  40157a:   c5 fd 59 8a 60 01 00    vmulpd 0x160(%rdx),%ymm0,%ymm1
  401581:   00 
  401582:   c5 f5 58 88 60 01 00    vaddpd 0x160(%rax),%ymm1,%ymm1
  401589:   00 
  40158a:   c5 fd 29 88 60 01 00    vmovapd %ymm1,0x160(%rax)
  401591:   00 
  401592:   c5 fd 59 8a 80 01 00    vmulpd 0x180(%rdx),%ymm0,%ymm1
  401599:   00 
  40159a:   c5 f5 58 88 80 01 00    vaddpd 0x180(%rax),%ymm1,%ymm1
  4015a1:   00 
  4015a2:   c5 fd 29 88 80 01 00    vmovapd %ymm1,0x180(%rax)
  4015a9:   00 
  4015aa:   c5 fd 59 8a a0 01 00    vmulpd 0x1a0(%rdx),%ymm0,%ymm1
  4015b1:   00 
  4015b2:   c5 f5 58 88 a0 01 00    vaddpd 0x1a0(%rax),%ymm1,%ymm1
  4015b9:   00 
  4015ba:   c5 fd 29 88 a0 01 00    vmovapd %ymm1,0x1a0(%rax)
  4015c1:   00 
  4015c2:   c5 fd 59 8a c0 01 00    vmulpd 0x1c0(%rdx),%ymm0,%ymm1
  4015c9:   00 
  4015ca:   c5 f5 58 88 c0 01 00    vaddpd 0x1c0(%rax),%ymm1,%ymm1
  4015d1:   00 
  4015d2:   c5 fd 29 88 c0 01 00    vmovapd %ymm1,0x1c0(%rax)
  4015d9:   00 
  4015da:   c5 fd 59 82 e0 01 00    vmulpd 0x1e0(%rdx),%ymm0,%ymm0
  4015e1:   00 
  4015e2:   c5 fd 58 80 e0 01 00    vaddpd 0x1e0(%rax),%ymm0,%ymm0
  4015e9:   00 
  4015ea:   48 81 c2 00 02 00 00    add    $0x200,%rdx
  4015f1:   48 39 ce                cmp    %rcx,%rsi
  4015f4:   c5 fd 29 80 e0 01 00    vmovapd %ymm0,0x1e0(%rax)
  4015fb:   00 
  4015fc:   0f 85 8e fe ff ff       jne    401490 <_Z10calc_blockv+0x30>
  401602:   48 83 c7 01             add    $0x1,%rdi
  401606:   48 05 00 02 00 00       add    $0x200,%rax
  40160c:   48 83 ff 40             cmp    $0x40,%rdi
  401610:   0f 85 5a fe ff ff       jne    401470 <_Z10calc_blockv+0x10>
  401616:   c5 f8 77                vzeroupper 
  401619:   c3                      retq   
  40161a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

解决方案

Here are three factors responsible for the performance difference between your code and BLAS (plus a note on Strassen’s algorithm).

In your inner loop, on k, you have y[k*dim + col]. Because of the way memory cache is arranged, consecutive values of k with the same dim and col map to the same cache set. The way cache is structured, each memory address has one cache set where its contents must be held while it is in cache. Each cache set has several lines (four is a typical number), and each of those lines can hold any of the memory addresses that map to that particular cache set.

Because your inner loop iterates through y in this way, each time it uses an element from y, it must load the memory for that element into the same set as the previous iteration did. This forces one of the previous cache lines in the set to be evicted. Then, in the next iteration of the col loop, all of the elements of y have been evicted from cache, so they must be reloaded again.

Thus, every time your loop loads an element of y, it must be loaded from memory, which takes many CPU cycles.

High-performance code avoids this in two ways. One, it divides the work into smaller blocks. The rows and the columns are partitioned into smaller sizes, and processed with shorter loops that are able to use all the elements in a cache line and to use each element several times before they go on to the next block. Thus, most of the references to elements of x and elements of y come from cache, often in a single processor cycle. Two, in some situations, the code will copy data out of a column of a matrix (which thrashes cache due to the geometry) into a row of a temporary buffer (which avoids thrashing). This again allows most of the memory references to be served from cache instead of from memory.

Another factor is the use of Single Instruction Multiple Data (SIMD) features. Many modern processors have instructions that load multiple elements (four float elements is typical, but some now do eight) in one instruction, store multiple elements, add multiple elements (e.g., for each of these four, add it to the corresponding one of those four), multiply multiple elements, and so on. Simply using such instructions immediately makes your code four times faster, provided you are able to arrange your work to use those instructions.

These instructions are not directly accessible in standard C. Some optimizers now try to use such instructions when they can, but this optimization is difficult, and it is not common to gain much benefit from it. Many compilers provide extensions to the language that give access to these instructions. Personally, I usually prefer to write in assembly language to use SIMD.

Another factor is using instruction-level parallel execution features on a processor. Observe that in your inner loop, acc is updated. The next iteration cannot add to acc until the previous iteration has finished updating acc. High-performance code will instead keep multiple sums running in parallel (even multiple SIMD sums). The result of this will be that while the addition for one sum is executing, the addition for another sum will be started. It is common on today’s processors to support four or more floating-point operations in progress at a time. As written, your code cannot do this at all. Some compilers will try to optimize the code by rearranging loops, but this requires the compiler to be able to see that iterations of a particular loop are independent from each other or can be commuted with another loop, et cetera.

It is quite feasible that using cache effectively provides a factor of ten performance improvement, SIMD provides another four, and instruction-level parallelism provides another four, giving 160 altogether.

Here is a very crude estimate of the effect of Strassen’s algorithm, based on this Wikipedia page. The Wikipedia page says Strassen is slightly better than direct multiplication around n = 100. This suggests the ratio of the constant factors of the execution times is 1003 / 1002.807 ≈ 2.4. Obviously, this will vary tremendously depending on processor model, matrix sizes interacting with cache effects, and so on. However, simple extrapolation shows that Strassen is about twice as good as direct multiplication at n = 4096 ((4096/100)3-2.807 ≈ 2.05). Again, that is just a ballpark estimate.

As for the later optimizations, consider this code in the inner loop:

bufz[trow][tcol] += B * bufy[tk][tcol];

One potential issue with this is that bufz could, in general, overlap bufy. Since you use global definitions for bufz and bufy, the compiler likely knows they do not overlap in this case. However, if you move this code into a subroutine that is passed bufz and bufy as parameters, and especially if you compile that subroutine in a separate source file, then the compiler is less likely to know that bufz and bufy do not overlap. In that case, the compiler cannot vectorize or otherwise reorder the code, because the bufz[trow][tcol] in this iteration might be the same as bufy[tk][tcol] in another iteration.

Even if the compiler can see that the subroutine is called with different bufz and bufy in the current source module, if the routine has extern linkage (the default), then the compiler has to allow for the routine to be called from an external module, so it must generate code that works correctly if bufz and bufy overlap. (One way the compiler can deal with this is to generate two versions of the routine, one to be called from external modules and one to be called from the module currently being compiled. Whether it does that depends on your compiler, the optimization switches, et cetera.) If you declare the routine as static, then the compiler knows it cannot be called from an external module (unless you take its address and there is a possibility the address is passed outside of the current module).

Another potential issue is that, even if the compiler vectorizes this code, it does not necessarily generate the best code for the processor you execute on. Looking at the generated assembly code, it appears the compiler is using only %ymm1 repeatedly. Over and over again, it multiplies a value from memory into %ymm1, adds a value from memory to %ymm1, and stores a value from %ymm1 to memory. There are two problems with this.

One, you do not want these partial sums stored to memory frequently. You want many additions accumulated into a register, and the register will be written to memory only infrequently. Convincing the compiler to do this likely requires rewriting the code to be explicit about keeping partial sums in temporary objects and writing them to memory after a loop has completed.

Two, these instructions are nominally serially dependent. The add cannot start until the multiply completes, and the store cannot write to memory until the add completes. The Core i7 has great capabilities for out-of-order execution. So, while it has that add waiting to start execution, it looks at the multiply later in the instruction stream and starts it. (Even though that multiply also uses %ymm1, the processor remaps the registers on the fly, so that it uses a different internal register to do this multiply.) Even though your code is filled with consecutive dependencies, the processor tries to execute several instructions at once. However, a number of things can interfere with this. You can run out of the internal registers the processor uses for renaming. The memory addresses you use might run into false conflicts. (The processor looks at a dozen or so of the low bits of memory addresses to see if the address might be the same as another one that it is trying to load or store from an earlier instruction. If the bits are equal, the processor has to delay the current load or store until it can verify the entire address is different. This delay can bollux up more than just the current load or store.) So, it is better to have instructions that are overtly independent.

That is one more reason I prefer to write high-performance code in assembly. To do it in C, you have to convince the compiler to give you instructions like this, by doing things such as writing some of your own SIMD code (using the language extensions for them) and manually unrolling loops (writing out multiple iterations).

When copying into and out of buffers, there might be similar issues. However, you report 90% of the time is spent in calc_block, so I have not looked at this closely.

Also, does Strassen’s algorithm account for any of the remaining difference?

这篇关于为什么单纯的C ++矩阵乘法比BLAS慢100倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆