优化的2x2矩阵乘法:慢装配与快速SIMD [英] Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD

查看:1354
本文介绍了优化的2x2矩阵乘法:慢装配与快速SIMD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我学习的高性能矩阵乘法等算法OpenBLAS或GotoBLAS,我试图重现一些结果。这个问题用一个矩阵乘法算法的内部核心交易。具体来说,我期待在计算 C + = AB ,其中 A B 是我的CPU的峰值速度型双击一个2x2矩阵。有两种方法可以做到这一点。一种方法是使用SIMD指令。第二种方法是code直接使用SIMD寄存器集。

I am studying high performance matrix multiplication algorithms such as OpenBLAS or GotoBLAS and I am trying to reproduce some of the results. This question deals with the inner kernel of a matrix multiplication algorithm. Specifically, I am looking at computing C += AB, where A and B are 2x2 matrices of type double at the peak speed of my CPU. There are two ways to do this. One way is to use SIMD instructions. The second way is to code directly in assembly using the SIMD registers.

我已经看过到目前为止

所有的相关论文,当然网页,很多很多的SO Q&放大器;与主体打交道(举不胜举),我编OpenBLAS我的电脑上,通过OpenBLAS,GotoBLAS观看,BLIS源$ C ​​$ CS,瓦格纳的手册。

All the relevant papers, course webpages, many many SO Q&As dealing with the subject (too many to list), I have compiled OpenBLAS on my computer, looked through OpenBLAS, GotoBLAS, and BLIS source codes, Agner's manuals.

硬件

我的CPU是英特尔酷睿i5 - 540M。你可以找到cpu-world.com相关的CPUID信息。该微架构Nehalem的是(Westmere处理器),所以理论上可以计算出每个内核每个周期有4个双precision无人问津。我将只使用一个核心(没有OpenMP的),因此与关闭超线程和4步英特尔睿频加速,我应该看到的峰值(2.533 GHz的+ 4 * 0.133千兆赫)*(4 DP触发器/核心/周期)*(1芯)= 12.27 DP GFLOPS 。作为参考,与高峰期运行两个内核,英特尔睿频加速给人以2步速度了,我应该得到 22.4 GFLOPS DP

My CPU is an Intel i5 - 540M. You can find the relevant CPUID information on cpu-world.com. The microarchitecture is Nehalem (westmere), so it can theoretically compute 4 double precision flops per core per cycle. I will be using just one core (no OpenMP), so with hyperthreading off and 4-step Intel Turbo Boost, I should be seeing a peak of ( 2.533 Ghz + 4*0.133 Ghz ) * ( 4 DP flops/core/cycle ) * ( 1 core ) = 12.27 DP Gflops. For reference, with both cores running at peak, Intel Turbo Boost gives a 2-step speed up and I should get a theoretical peak of 22.4 DP Gflops.

设置

我宣布我的2x2矩阵为双击并显示在下面的code段随机项初始化。

I declare my 2x2 matrices as double and initialize them with random entries as shown in the code snippet below.

srand(time(NULL));
const int n = 2;
double A[n*n];
double B[n*n];
double C[n*n];
double T[n*n];
for(int i = 0; i < n*n; i++){
    A[i] = (double) rand()/RAND_MAX;
    B[i] = (double) rand()/RAND_MAX;
    C[i] = 0.0;
}

我用天真的矩阵间multiplcation(如下图所示),让我要么检查我的结果视觉或通过计算所有元素的L2范数计算真正的答案

I compute a true answer using naive matrix-matrix multiplcation (shown below) which allows me to check my result either visually or by computing the L2 norm of all the elements

// "true" answer
for(int i = 0; i < n; i++)
    for(int j = 0; j < n; j++)
        for(int k = 0; k < n; k++)
            T[i*n + j] += A[i*n + k]*B[k*n + j];

要运行code和获得GFLOPS的估计,我呼吁每一个乘法函数一次热身,然后执行它环路内 MAXITER 次,确保零, C 矩阵中的每个时间,我计算 C + = AB 。在循环里面放置两个时钟()语句,这是用来估计GFLOPS。在code段的打击说明了这部分内容。

To run the code and get an estimate of the Gflops, I call each multiplication function once to warmup, and then execute it inside a for loop for maxiter times, making sure to zero the C matrix each time as I am computing C += AB. The for loop is placed inside two clock() statements and this is used to estimate the Gflops. The code snippet blow illustrates this part.

C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
mult2by2(A,B,C); //warmup
time1 = clock();
for(int i = 0; i < maxiter; i++){
        mult2by2(A,B,C);
        C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
}
time2 = clock() - time1;
time3 = (double)(time2)/CLOCKS_PER_SEC;
gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
mult2by2(A,B,C); // to compute the norm against T
norm = L2norm(n,C,T);

SIMD code

我的CPU支持128位向量,这样我就可以在每个向量适合2 双击秒。这就是为什么我在内部内核做的2x2矩阵乘法的主要原因。该SIMD code在同一时间计算 C 的一整行。

My CPU supports 128-bit vectors, so I can fit 2 doubles in each vector. This is the main reason why I am doing 2x2 matrix multiplication in the inner kernel. The SIMD code computes an entire row of C at one time.

    inline void 
    __attribute__ ((gnu_inline))        
    __attribute__ ((aligned(16))) mult2by2B(        
            const double* restrict A,
            const double* restrict B,
            double* restrict C
        )

    {

    register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
    xmm0 = _mm_load_pd(C);
    xmm1 = _mm_load1_pd(A);
    xmm2 = _mm_load_pd(B);
    xmm3 = _mm_load1_pd(A + 1);
    xmm4 = _mm_load_pd(B + 2);
    xmm1 = _mm_mul_pd(xmm1,xmm2);
    xmm2 = _mm_add_pd(xmm1,xmm0);
    xmm1 = _mm_mul_pd(xmm3,xmm4);
    xmm2 = _mm_add_pd(xmm1,xmm2);
    _mm_store_pd(C,xmm2);

    xmm0 = _mm_load_pd(C + 2);
    xmm1 = _mm_load1_pd(A + 2);
    xmm2 = _mm_load_pd(B);
    xmm3 = _mm_load1_pd(A + 3);
    //xmm4 = _mm_load_pd(B + 2);
    xmm1 = _mm_mul_pd(xmm1,xmm2);
    xmm2 = _mm_add_pd(xmm1,xmm0);
    xmm1 = _mm_mul_pd(xmm3,xmm4);
    xmm2 = _mm_add_pd(xmm1,xmm2);
    _mm_store_pd(C + 2,xmm2);
}

Assmebly(Intel语法)

我的第一个尝试是创建一个单独的汇编程序的这一部分,从程序调用它。然而,这是极为缓慢的,因为我不能内联的extern 功能。我写的程序集内联汇编如下图所示。它的相同的到这是由制作的gcc -S -std = C99 -O3 -msse3 -ffast-数学-march = Nocona的-mtune = Nocona的-funroll-清一色循环-fomit-frame-pointer的-masm =英特尔。从我了解的Nehalem微架构的图表,该处理器可以执行 SSE添加 SSE MUL SSE MOV 并行,这也说明 MUL 添加的交错, MOV 的说明。你会发现上面的SIMD指令以不同的顺序,因为我从瓦格纳雾手册有不同的理解。尽管如此, GCC 很聪明,上面的SIMD code编译在内嵌版出的组件。

My first attempt was to create a separate assembly routine for this part and call it from the main routine. However, it was extremely slow because I cannot inline extern functions. I wrote the assembly as inline assembly as shown below. It is identical to that which is produced by gcc -S -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel. From what I understand of the Nehalem microarchitecture diagrams, this processor can perform SSE ADD, SSE MUL, and SSE MOV in parallel, which explains the interleaving of MUL, ADD, MOV instructions. You will notice the SIMD instructions above are in a different order because I had a different understanding from Agner Fog's manuals. Nevertheless, gcc is smart and the SIMD code above compiles to the assembly shown in the inline version.

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(16))) mult2by2A
    (   
        const double* restrict A,
        const double* restrict B,
        double* restrict C
    )
    {
    __asm__ __volatile__
    (
    "mov        edx, %[A]                   \n\t"
    "mov        ecx, %[B]                   \n\t"
    "mov        eax, %[C]                   \n\t"
    "movapd     xmm3, XMMWORD PTR [ecx]     \n\t"
    "movapd     xmm2, XMMWORD PTR [ecx+16]  \n\t"
    "movddup    xmm1, QWORD PTR [edx]       \n\t"
    "mulpd      xmm1, xmm3                  \n\t"
    "addpd      xmm1, XMMWORD PTR [eax]     \n\t"
    "movddup    xmm0, QWORD PTR [edx+8]     \n\t"
    "mulpd      xmm0, xmm2                  \n\t"
    "addpd      xmm0, xmm1                  \n\t"
    "movapd     XMMWORD PTR [eax], xmm0     \n\t"
    "movddup    xmm4, QWORD PTR [edx+16]    \n\t"
    "mulpd      xmm4, xmm3                  \n\t"
    "addpd      xmm4, XMMWORD PTR [eax+16]  \n\t"
    "movddup    xmm5, QWORD PTR [edx+24]    \n\t"
    "mulpd      xmm5, xmm2                  \n\t"
    "addpd      xmm5, xmm4                  \n\t"
    "movapd     XMMWORD PTR [eax+16], xmm5  \n\t"
    : // no outputs 
    : // inputs
    [A] "m" (A),
    [B] "m" (B), 
    [C] "m" (C)
    : //register clobber
    "memory",
    "edx","ecx","eax",
    "xmm0","xmm1","xmm2","xmm3","xmm4","xmm5"
    );
}

结果

我编译我的code具有以下标志:

I compile my code with the following flags:

gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel

有关结果 MAXITER =十亿低于:

********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 9.563000, Avg. Gflops: 1.673115

********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 0.359000, Avg. Gflops: 44.568245

如果我强迫SIMD版本不与内联__ attribute__((noinline始终)),结果是:

If I force the SIMD version to not be inlined with __attribute__ ((noinline)), the results are:

********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 11.155000, Avg. Gflops: 1.434334

********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 11.264000, Avg. Gflops: 1.420455

问题


  1. 如果内联ASM和SIMD实现都产生相同的汇编输出,为什么程序集版本这么慢得多?这是因为如果内联组件没有得到内联,这是由第二组结果表示为内联的ASM​​与noinline始终SIMD相同性能所作明显。我能找到的唯一解释是的瓦格纳雾第2卷第6页的:

编译code可能会比组装code更快,因为编译器可以使
  -程序间优化和全程序优化。组装
  程序员通常有一个明确的呼叫作出明确定义的功能
  接口服从所有调用约定,以使code可测试和
  验证。这prevents很多的优化方法是编译器使用,例如
  作为内联函数,寄存器分配,常量传播,常见的SUBEX pression
  消除跨职能,跨职能调度等,这些
  优点可以用C ++ code具有内在的功能,而不是获得
  装配code。

Compiled code may be faster than assembly code because compilers can make inter-procedural optimization and whole-program optimization. The assembly programmer usually has to make well-defined functions with a well-defined call interface that obeys all calling conventions in order to make the code testable and verifiable. This prevents many of the optimization methods that compilers use, such as function inlining, register allocation, constant propagation, common subexpression elimination across functions, scheduling across functions, etc. These advantages can be obtained by using C++ code with intrinsic functions instead of assembly code.

不过,汇编器输出的两个版本是完全一样的。

But the assembler output for both versions is exactly the same.

为什么我看到44 GFLOPS在第一盘的结果呢?这就是我上面计算的12 GFLOPS峰值的方式,而这也正是我所期望的,如果我跑单precision计算两个内核。

Why am I seeing 44 Gflops in the first set of results? This is way above the 12 Gflops peak I calculated, and is what I would expect if I was running both cores with single precision calculations.

修改1
该评论说,有可能是死code消除我可以证实,这是发生了SIMD指令。在 -S 输出显示环路SIMD只有零 C 矩阵。我可以用 -O0 关闭编译器优化禁用。在这种情况下,单指令多数据运行的3倍为ASM慢,但ASM依旧以完全相同的速度运行。该规范现在也为零,但它仍然在10 ^ -16确定。我也看到了内联汇编版本内联与 APP NO_APP 标签,但它也正在展开在 8次循环。我想展开,很多时候会严重影响性能,因为我平时解开循环4次。再说了,以我的经验,似乎降低性能。

EDIT 1 The comment says there might be dead code elimination I can confirm that this is happening for the SIMd instructions. The -S output shows that the for loop for SIMD only zeros C matrix. I can disable that by turning off compiler optimization with -O0. In that case, SIMD runs 3x as slow as ASM, but ASM still runs at exactly the same speed. The norm is also nonzero now, but it's still OK at 10^-16. I also see that the inline ASM version is being inlined with the APP and NO_APP tags, but it is also being unrolled 8 times in the for loop. I think unrolling that many times will impact performance heavily, as I usually unroll loops 4 times. Anything more, in my experience, seems to degrade performance.

推荐答案

GCC被优化掉使用内部函数, mult2by2B ,你的内联函数由于行

GCC is optimizing away your inline function using intrinsics, mult2by2B, due to the line

C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;

如果没有这条线需要2.9秒从Coliru在计算机上
http://coliru.stacked-crooked.com/a/992304f5f672e257

和用线只需要0.000001
http://coliru.stacked-crooked.com/a/9722c39bb6b8590a

And with the line it only takes 0.000001 http://coliru.stacked-crooked.com/a/9722c39bb6b8590a

您还可以看到这在装配。如果您删除code以下,进入 http://gcc.godbolt.org/ 你会看到,与该行code它完全跳过功能。

You can also see this in the assembly. If you drop the code below into http://gcc.godbolt.org/ you will see that with that line of code it skips the function entirely.

但是,当您内联汇编GCC不优化功能, mult2by2A ,走(即使它内联的话)。您可以在装配看到这一点。

However, when you inline the assembly GCC is NOT optimizing the function, mult2by2A, away (even though it inlines it). You can see this in the assembly as well.

#include <stdio.h>
#include <emmintrin.h>                 // SSE2
#include <omp.h>

inline void 
    __attribute__ ((gnu_inline))        
    __attribute__ ((aligned(16))) mult2by2B(        
            const double* __restrict A,
            const double* __restrict B,
            double* __restrict C
        )

    {

    register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
    xmm0 = _mm_load_pd(C);
    xmm1 = _mm_load1_pd(A);
    xmm2 = _mm_load_pd(B);
    xmm3 = _mm_load1_pd(A + 1);
    xmm4 = _mm_load_pd(B + 2);
    xmm1 = _mm_mul_pd(xmm1,xmm2);
    xmm2 = _mm_add_pd(xmm1,xmm0);
    xmm1 = _mm_mul_pd(xmm3,xmm4);
    xmm2 = _mm_add_pd(xmm1,xmm2);
    _mm_store_pd(C,xmm2);

    xmm0 = _mm_load_pd(C + 2);
    xmm1 = _mm_load1_pd(A + 2);
    xmm2 = _mm_load_pd(B);
    xmm3 = _mm_load1_pd(A + 3);
    //xmm4 = _mm_load_pd(B + 2);
    xmm1 = _mm_mul_pd(xmm1,xmm2);
    xmm2 = _mm_add_pd(xmm1,xmm0);
    xmm1 = _mm_mul_pd(xmm3,xmm4);
    xmm2 = _mm_add_pd(xmm1,xmm2);
    _mm_store_pd(C + 2,xmm2);
}

int main() {
  double A[4], B[4], C[4];
  int maxiter = 10000000;
  //int maxiter = 1000000000;
  double dtime;
  dtime = omp_get_wtime();
  for(int i = 0; i < maxiter; i++){
        mult2by2B(A,B,C);
        C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
  }
  dtime = omp_get_wtime() - dtime;
  printf("%f %f %f %f\n", C[0], C[1], C[2], C[3]);
  //gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
  printf("time %f\n", dtime);
}

这篇关于优化的2x2矩阵乘法:慢装配与快速SIMD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆