复制BLAS矩阵乘法的表现：我能与之匹敌？ [英] Replicating BLAS matrix multiplication performance: Can I match it?

查看：1362 发布时间：2016/7/18 20:34:24 c optimization assembly blas

本文介绍了复制BLAS矩阵乘法的表现：我能与之匹敌？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景

如果你一直在关注我的帖子，我试图复制在一成转到对方阵乘积 C = AB 开创性论文中的结果。我对这个话题的最后一个职位，可以发现<一个href=\"http://stackoverflow.com/questions/23964334/cant-get-over-50-max-theoretical-performance-on-matrix-multiply\">here.在该版本我的code的，我按照记忆分层和后藤的包装战略，内部的内核计算 C <中的2x8 块/ code>使用128位SSE3内部函数。我的CPU是睿i5-540M与超线程关闭。关于我的硬件附加信息可以在另一个中找到<一个href=\"http://stackoverflow.com/questions/23790842/optimized-2x2-matrix-multiplication-slow-assembly-versus-fast-simd\">post并且，下面重复。

我的硬件

我的CPU是英特尔酷睿i5 - 540M。你可以找到cpu-world.com相关的CPUID信息。该微架构Nehalem的是（Westmere处理器），所以理论上可以计算出每个内核每个周期有4个双precision无人问津。我将只使用一个核心（没有OpenMP的），因此与关闭超线程和4步英特尔睿频加速，我应该看到的峰值（2.533 GHz的+ 4 * 0.133千兆赫）*（4 DP触发器/核心/周期）*（1芯）= 12.27 DP GFLOPS 。作为参考，与高峰期运行两个内核，英特尔睿频加速给人以2步速度了，我应该得到 22.4 GFLOPS DP 。

的理论峰值 我的软件 
 64的Windows7位，但的MinGW / GCC 32位，由于我的电脑上的限制。
 什么是新的这个时候？ 

的C 我计算 2×4 块。这给了更好的性能，并与后藤什么说（半寄存器shoubld用于计算 C ）。我已经尝试了许多大小：的1x8 ，的2x8 ， 2×4  ， 4×， 2×2 ， 4×4 。

我内心的内核手工codeD x86汇编，优化，尽我最大的能力（相匹配一些转到写的内核），这给出了SIMD一个相当大的性能提升。这code被展开8次内侧核（定义为方便起见，宏）内，发放等展开战略我想最好的性能。

我使用Windows性能计数器来一次codeS，而不是时钟（）。

我时间内的独立内核的总code，看到我的手工codeD组件如何做的。

我汇报一些的实验中，而不是平均值在试验最好的结果。

没有更多的OpenMP（单核性能比较只）。

的注意的我将重新编译OpenBLAS今晚只使用1核心，所以我可以比较。

 有些preliminary结果 
  N 是方阵的尺寸，共GFLOPS / S 是完整的GFLOPS /秒code和内核GFLOPS / S 是内部核心的GFLOPS /秒。你可以看到，有12.26 GFLOPS / s的峰值上的一个核心，内部内核越来越大约75％的效率，而整体code是约60％的效率。 
  我想获得接近95％的效率为核心，80％的整体code。我还能做些什么来提高性能，至少内内核？的 
  N个全部GFLOPS / S内核GFLOPS /秒
     256 7.778089 9.756284
     512 7.308523 9.462700
     768 7.223283 9.253639
    1024 7.197375 9.132235
    1280 7.142538 8.974122
    1536 7.114665 8.967249
    1792 7.060789 8.861958
 
 来源$ C $ C  
如果你感觉特别坦荡，请测试你的机器上我的code。与 GCC编译-std = C99 -O2 -m32 -msse3 -mincoming堆栈边界= 2 -masm =英特尔somatmul2.c -o somatmul2.exe 。随意尝试其他的标志，但我发现这些工作最好的我的机器上。
 的#include＆LT;＆stdio.h中GT;
＃包括LT＆;＆time.h中GT;
＃包括LT＆;＆stdlib.h中GT;
＃包括LT＆;＆string.h中GT;
＃包括LT＆;＆xmmintrin.h GT;
＃包括LT＆;＆math.h中GT;
＃包括LT＆;＆omp.h GT;
＃包括LT＆;＆stdint.h GT;
＃包括LT＆;＆WINDOWS.H GT;的#ifndef分钟
    的#define分钟（A，B）（（（A）≤（b））的（一）：（b））的
＃万一内嵌无效
__attribute__（（gnu_inline））
__attribute__（（排列（64）））RPACK（双*限制DST，
          const的双重限制* SRC，
            const int的KC，const int的MC，const int的先生，const int的N）
{
    双TMP [MC * KC] __attribute__（（排列（64）））;
    双* PTR限制=安培; TMP [0];    的for（int i = 0; I＆LT; MC ++ I）
        对于（INT J = 0; J＆LT; KC; J + = 4）{
            * PTR ++ = *（SRC + I * N + J）;
            * PTR ++ = *（SRC + I * N + J + 1）;
            * PTR ++ = *（SRC + I * N + J + 2）;
            * PTR ++ = *（SRC + I * N + J + 3）;
        }
    PTR =安培; TMP [0];    // const int的inc_dst = MR * KC;
    对于（INT K = 0; K＆LT; MC; K + = MR）
        对于（INT J = 0; J＆LT; KC ++ j）条
            的for（int i = 0; I＆LT; MR * KC; I + = KC）
                * DST ++ = *（PTR + K * KC + J + I）;}内嵌无效
__attribute__（（gnu_inline））
__attribute__（（排列（64）））cpack（双*限制DST，
                const的双重限制* SRC，
                const int的NC，
                const int的KC，
                const int的NR，
                const int的N）
{// NC COLS，KC行，NR大小ofregister条
    双TMP [KC * NC] __attribute__（（排列（64）））;
    双* PTR限制=安培; TMP [0];    的for（int i = 0; I＆LT; KC ++ I）
        对于（INT J = 0; J＆LT;数控; J + = 4）{
            * PTR ++ = *（SRC + I * N + J）;
            * PTR ++ = *（SRC + I * N + J + 1）;
            * PTR ++ = *（SRC + I * N + J + 2）;
            * PTR ++ = *（SRC + I * N + J + 3）;
        }    PTR =安培; TMP [0];    // const int的inc_k = NC / NR;
    对于（INT K = 0; K＆LT;数控; K + = NR）
        对于（INT J = 0; J＆LT; KC * NC; J + = NC）
            的for（int i = 0; I＆LT; NR; ++ I）
                * DST ++ = *（PTR + K + I + J）;}＃定义KERNEL0（ADD0，ADD1，ADD2，ADD3）\\
            mulpd XMM4，xmm6 \\ n \\ t的\\
            addpd XMM0，XMM4 \\ n \\ t的\\
            MOVAPD XMM4，XMMWORD PTR [EDX +＃ADD2] \\ n \\ t的\\
            mulpd XMM7，XMM4 \\ n \\ t的\\
            addpd将xmm1，XMM7 \\ n \\ t的\\
            movddup xmm5，QWORD PTR [EAX +＃ADD0] \\ n \\ t的\\
            mulpd xmm6，xmm5 \\ n \\ t的\\
            addpd XMM2，xmm6 \\ n \\ t的\\
            movddup XMM7，QWORD PTR [EAX +＃ADD1] \\ n \\ t的\\
            mulpd XMM4，xmm5 \\ n \\ t的\\
            MOVAPD xmm6，XMMWORD PTR [EDX +＃ADD3] \\ n \\ t的\\
            addpd XMM3，XMM4 \\ n \\ t的\\
            MOVAPD XMM4，XMM7 \\ n \\ t的\\
            \\ n \\ t的内嵌无效
__attribute__（（gnu_inline））
__attribute__（（排列（64）））dgemm_2x4_asm_j
（
        const int的MC，const int的NC，const int的KC，
        常量双*限制LOCA，const int的CS_A，//先生
        常量双*限制locB，const int的rs_b，// NR
              双*限制C，const int的rs_c
    ）
{    常量双*限制A1 =失水;
    的for（int i = 0; I＆LT; MC; I + = CS_A）{
        常量双*限制B1 = locB;
        双*限制C11 = C + I * rs_c;
        对于（INT J = 0; J＆LT;数控; J + = rs_b）{            __asm__ __volatile__
        （
            MOV EAX，％[A1] \\ n \\ t的
            MOV EDX，％[B1] \\ n \\ t的
            MOV EDI，％[C11] \\ n \\ t的
            MOV ECX，％[KC] \\ n \\ t的
            PXOR XMM0，XMM0 \\ n \\ t的
            movddup XMM7，QWORD PTR [EAX]的\\ n \\ t// A1
            PXOR将xmm1，xmm1中的\\ n \\ t的
            MOVAPD xmm6，XMMWORD PTR [EDX]的\\ n \\ t// B1
            PXOR XMM2，XMM2 \\ n \\ t的
            MOVAPD XMM4，XMM7 \\ n \\ t// A1
            PXOR XMM3，XMM3 \\ n \\ t的
            特区ECX，3 \\ n \\ t//除以2 ^ NUM
            L％=：\\ n \\ t//开始循环
            KERNEL0（8，16，16，32）
            KERNEL0（24，32，48，64）
            KERNEL0（40，48，80，96）
            KERNEL0（56，64，112，128）
            KERNEL0（72，80，144，160）
            KERNEL0（88，96，176，192）
            KERNEL0（104，112，208，224）
            KERNEL0（120，128，240，256）
            添加EAX，128 \\ n \\ t的
            添加EDX，256 \\ n \\ t的
            \\ n \\ t的
            DEC ECX \\ n \\ t的
            JNE％l = \\ n \\ t//循环结束
            \\ n \\ t的
            MOV ESI，％[rs_c]的\\ n \\ t//不需要再CS_A
            萨尔ESI，3 \\ n \\ t// 8倍
            LEA EBX，[EDI + ESI] \\ n \\ t//不需要再B1
            addpd XMM0，XMMWORD PTR [EDI]的\\ n \\ t// C11
            addpd将xmm1，XMMWORD PTR [EDI + 16] \\ n \\ t// C11 + 2
            addpd XMM2，XMMWORD PTR [EBX]的\\ n \\ t// C11
            addpd XMM3，XMMWORD PTR [EBX + 16] \\ n \\ t// C11 + 2
            MOVAPD XMMWORD PTR [EDI]，XMM0 \\ n \\ t的
            MOVAPD XMMWORD PTR [EDI + 16]，将xmm1 \\ n \\ t的
            MOVAPD XMMWORD PTR [EBX]，XMM2 \\ n \\ t的
            MOVAPD XMMWORD PTR [EBX + 16]，XMM3 \\ n \\ t的
            ：//没有输出
            ：//输入
            [KC]M（KC）
            〔A1〕的m（a1）中，
            [CS_A]的m（CS_A），
            [B1]的m（b1）中，
            [rs_b]的m（rs_b），
            [C11]的m（C11），
            [rs_c]M（rs_c）
            ：//注册撞
            记忆，
            EAX，EBX，ECX，EDX，ESI，EDI，
            XMM0，将xmm1，将xmm2，XMM3，XMM4，xmm5，xmm6，XMM7
            ）;
        B1 + = rs_b * KC;
        C11 + = rs_b;
        }
    A1 + = CS_A * KC;
    }
}
双blis_dgemm_ref（
        const int的N，
        常量双*制约A，
        常量双*限制B，
        双*限制C，
        const int的MC，
        const int的NC，
        const int的KC
    ）
{
    INT MR = 2;
    INT NR = 4;
    双LOCA [MC * KC] __attribute__（（排列（64）））;
    双locB [KC * NC] __attribute__（（排列（64）））;    LARGE_INTEGER频率，时间1，时间2;
    双时间3 = 0.0;
    QueryPerformanceFrequency的（安培;频率）;    //零Ç
    memset的（C，O，N * N *的sizeof（双））;    INT II，JJ，KK;
    //＃编译OMP并行NUM_THREADS（2）共享（A，B，C）私人（II，JJ，KK，失水，locB）
    {//并联使用的所有主题
        //＃编译为OMP
        对于（JJ = 0; JJ＆LT; N，JJ + = NC）{
            对于（KK = 0;＆KK LT; N，KK + = KC）{
                cpack（locB，B + KK * N + JJ，NC，KC，NR，N）;
                用于（ⅱ= 0; II蛋白酶; N;ⅱ+ = MC）{
                    RPACK（LOCA，A + II * N + KK，KC，MC，MR，N）;
                            QueryPerformanceCounter的（安培;时间1）;
                            dgemm_2x4_asm_j（MC，NC，KC，
                                       LOCA，先生，
                                       locB，NR，
                                       C + II * N + JJ，N）;
                            QueryPerformanceCounter的（安培;时间2）;
                            时间3 + =（双）（time2.QuadPart  -  time1.QuadPart）;
                }
            }
        }
    }
    返回时间3 / frequency.QuadPart;
}双compute_gflops（const的双倍时间，const int的N）
{
    //计算亿次为一个方阵矩阵乘法
    双GFLOPS;
    GFLOPS =（双）（2.0 * N * N * N）/time/1.0e9;
    返回（GFLOPS）;
}无效的主要（）{
    LARGE_INTEGER频率，时间1，时间2;
    双时间3，best_time;
    双kernel_time，best_kernel_time;
    QueryPerformanceFrequency的（安培;频率）;
    INT best_flag;
    双GFLOPS，kernel_gflops;    const int的试验= 100;
    INT n最大= 4096;
    的printf（％16S 16S％16S％的\\ n，N，总GFLOPS / S，内核GFLOPS / S）;    INT MC = 256;
    INT KC = 256;
    INT NC = 128;
    对于（INT N = KC; N＆LT; = n最大; N + = KC）{
        双*一个= NULL;
        双* B = NULL;
        双* C = NULL;        A = _mm_malloc（N * N *的sizeof（* A），64）;如果（！A）{printf的（A失败\\ n）;返回;}
        B = _mm_malloc（N * N * sizeof的（* B），64）;如果（！B）{printf的（B失败\\ n）;返回;}
        C = _mm_malloc（N * N * sizeof的（* C），64）;如果（！C）{printf的（C失败\\ n）;返回;}        函数srand（时间（NULL））;        //创建矩阵
        的for（int i = 0; I＆LT; N;我++）{
            对于（INT J = 0; J＆LT; N; J ++）{
                *（A + I * N + J）=（双）兰特（）/ RAND_MAX;
                *（B + I * N + J）=（双）兰特（）/ RAND_MAX;
            }
        }            // 暖身
            blis_dgemm_ref（N，A，B，C，MC，NC，KC）;            对于（诠释计数= 0; COUNT＆LT;试验;计数++）{
                    QueryPerformanceCounter的（安培;时间1）;
                    kernel_time = blis_dgemm_ref（N，A，B，C，MC，NC，KC）;
                    QueryPerformanceCounter的（安培;时间2）;
                    时间3 =（双）（time2.QuadPart  -  time1.QuadPart）/ frequency.QuadPart;
                如果（计数== 0）{
                    best_time =时间3;
                    best_kernel_time = kernel_time;
                }
                其他{
                    best_flag =（时间3＆下; best_time？1：0）;
                    如果（best_flag）{
                        best_time =时间3;
                        best_kernel_time = kernel_time;
                    }
                }
            }
            GFLOPS = compute_gflops（best_time，N）;
            kernel_gflops = compute_gflops（best_kernel_time，N）;
            的printf（％16D％16F％16F \\ N，N，GFLOPS，kernel_gflops）;        _mm_free（A）;
        _mm_free（B）
        _mm_free（℃）;
    }
    的printf（测试完成后\\ n）;}
 
 修改 
以下新的，更快的版本替换包装的功能：
 内嵌无效
__attribute__（（gnu_inline））
__attribute__（（排列（64）））RPACK（双*限制DST，
          const的双重限制* SRC，
            const int的KC，const int的MC，const int的先生，const int的N）
{
    的for（int i = 0; I＆LT; MC / MR ++ I）
        对于（INT J = 0; J＆LT; KC ++ j）条
            对于（INT K = 0; K＆LT;先生; ++ K）
                * DST ++ = *（SRC + I * N *先生+ K * N + J）;
}内嵌无效
__attribute__（（gnu_inline））
__attribute__（（排列（64）））cpack（双*限制DST，
                const的双重限制* SRC，
                const int的NC，
                const int的KC，
                const int的NR，
                const int的N）
{
    的for（int i = 0; I＆LT; NC / NR; ++ I）
        对于（INT J = 0; J＆LT; KC ++ j）条
            对于（INT K = 0; K＆LT; NR; ++ K）
                * DST ++ = *（SRC + I * NR + j的* N + K）;
}
 
 结果随着包装新功能 
尼斯推动了整体性能：
  N个全部GFLOPS / S内核GFLOPS /秒
     256 7.915617 8.794849
     512 8.466467 9.350920
     768 8.354890 9.135575
    1024 8.168944 8.884611
    1280 8.174249 8.825920
    1536 8.285458 8.938712
    1792 7.988038 8.581001
 
  LINPACK 32位有1个线程 
  CPU频率：2.792 GHz的
CPU数量：1
核心数：2
线程数：1性能总结（GFLOPS）
尺寸LDA对齐。平均最大
128 128 4 4.7488 5.0094
256 256 4 6.0747 6.9652
384 384 4 6.5208 7.2767
512 512 4 6.8329 7.5706
640 640 4 7.4278 7.8835
768 768 4 7.7622 8.0677
896 896 4 7.8860 8.4737
1024 1024 4 7.7292 8.1076
1152 1152 4 8.0411 8.4738
1280 1280 4 8.1429 8.4863
1408 1408 4 8.2284 8.7073
1536 1536 4 8.3753 8.6437
1664 1664 4 8.6993 8.9108
1792 1792 4 8.7576 8.9176
1920 1920 4 8.7945 9.0678
2048 2048 4 8.5490 8.8827
2176 2176 4 9.0138 9.1161
2304 2304 4 8.1402 9.1446
2432 2432 4 9.0003 9.2082
2560 2560 4 8.8560 9.1197
2688 2688 4 9.1008 9.3144
2816 2816 4 9.0876 9.3089
2944 2944 4 9.0771 9.4191
3072 3072 4 8.9402 9.2920
3200 3200 4 9.2259 9.3699
3328 3328 4 9.1224 9.3821
3456 3456 4 9.1354 9.4082
3584 3584 4 9.0489 9.3351
3712 3712 4 9.3093 9.5108
3840 3840 4 9.3307 9.5324
3968 3968 4 9.3895 9.5352
4096 4096 4 9.3269 9.3872
 
 编辑2  
下面是从单线程OpenBLAS，历时4个小时编译昨晚一定的成效。正如你所看到的，它正逐渐接近95％的CPU使用率。对两个内核最大单线程性能 11.2 GFLOPS （2步英特尔睿频加速）。我需要关闭其他核心来获得 12.26 GFLOPS （4步英特尔睿频加速）。的假设的，在OpeBLAS包装功能不产生额外的开销。那么OpenBLAS内核至少必须尽可能快地总OpenBLAS code运行。所以，我需要让我的内核以该速度运行。我还没有想出如何使我的组装速度更快。我将重点放在这在接下来的几天。
冉下面的测试，从Windows命令行使用：启动/实时/亲和力1 
我的code：
  N个全部GFLOPS / S内核GFLOPS /秒
       256 7.927740 8.832366
       512 8.427591 9.347094
       768 8.547722 9.352993
      1024 8.597336 9.351794
      1280 8.588663 9.296724
      1536 8.589808 9.271710
      1792 8.634201 9.306406
      2048 8.527889 9.235653
 
 OpenBLAS：
  N个全部GFLOPS /秒
   256 10.599065
   512 10.622686
   768 10.745133
  1024 10.762757
  1280 10.832540
  1536 10.793132
  1792 10.848356
  2048 10.819986
 

解决方案
 这是理论上的可能看那个code和原因，通过它是否能安排以更好地利用微架构的资源 - 但即使是表现建筑师在英特尔可能不推荐这样做这种方式。这可能有助于使用像VTune™可视化或<一个工具href=\"https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization\"相对=nofollow>英特尔性能计数器监视器以找出多少你的工作量是内存与前端与后端界。 英特尔架构code分析仪可能也有帮助缩小快速源跟进第一其中下列的潜在问题。
标称动物可能是在谈论交织访问存储器指令的意见和那些计算在正确的轨道上。其他一些可能性：

使用其它指令的一些计算可以减少对执行端口之一pressure（见<节3.3.4 href=\"https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEEQFjAD&url=http%3A%2F%2Fsc.tamu.edu%2Fsystems%2Feos%2Fnehalem.pdf&ei=YrzGU_nMC-GT8AHr64GYDw&usg=AFQjCNG7gOE5C6uXy2SS2FMzFKQpKU8mTg&sig2=07NkI5980EiPDFMwN26hCg&bvm=bv.71126742,d.b2U&cad=rja\"相对=nofollow>这presentation ）。尤其是， mulpd 总是要分发到端口1的Westmere。也许，如果有其中port 0不习惯于任何周期，你可以在一个标量FP偷偷乘那里。

一个或另一个硬件prefetchers的可能是<一个href=\"https://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-$p$pfetchers\"相对=nofollow>早饱和总线或污染与你最终不会使用线缓存。

在另一方面，有一个苗条的可能性，即内存引用排序或存储布局 dgemm_2x4_asm_j 暗示是伪造出来的prefetchers。

更改对不具有任何数据依赖的指令的相对顺序可能会导致更好地利用前端或后端资源。

Background

If you have been following my posts, I am attempting to replicate the results found in Kazushige Goto's seminal paper on square matrix multiplication C = AB. My last post regarding this topic can be found here. In that version of my code, I follow the memory layering and packing strategy of Goto with an inner kernel computing 2x8 blocks of C using 128 bit SSE3 intrinsics. My CPU is i5-540M with hyperthreading off. Additional info about my hardware can be found in another post and is repeated below.

My Hardware

My CPU is an Intel i5 - 540M. You can find the relevant CPUID information on cpu-world.com. The microarchitecture is Nehalem (westmere), so it can theoretically compute 4 double precision flops per core per cycle. I will be using just one core (no OpenMP), so with hyperthreading off and 4-step Intel Turbo Boost, I should be seeing a peak of ( 2.533 Ghz + 4*0.133 Ghz ) * ( 4 DP flops/core/cycle ) * ( 1 core ) = 12.27 DP Gflops. For reference, with both cores running at peak, Intel Turbo Boost gives a 2-step speed up and I should get a theoretical peak of 22.4 DP Gflops.

My Software

Windows7 64 bit, but MinGW/GCC 32 bit due to restrictions on my computer.

What's new this time?


I compute 2x4 blocks of C. This gives better performance and is in line with what Goto says (half the registers shoubld be used to compute C). I have tried many sizes: 1x8, 2x8, 2x4, 4x2, 2x2, 4x4.
My inner kernel is hand-coded x86 assembly, optimized to the best of my ability (matches some of the kernels that Goto wrote), which gives a rather large performance boost over SIMD. This code is unrolled 8 times inside the inner kernel (defined as a macro for convenience), giving the best performance out of other unrolling strategies I tried.
I use the Windows performance counters to time the codes, rather than clock().
I time the inner kernel independently of the total code, to see how well my hand-coded assembly is doing.
I report the best result from some number of trials, rather than an average over the trials.
No more OpenMP (single core perfomance only).
NOTE I will recompile OpenBLAS tonight to use only 1 core so I can compare.


Some Preliminary Results

N is the dimension of the square matrices, Total Gflops/s is the Gflops/s of the entire code, and Kernel Gflops/s is the Gflops/s of the inner kernel. You can see that with a 12.26 Gflops/s peak on one core, the inner kernel is getting around 75% efficiency, while the overall code is about 60% efficient. 

I would like to get closer to 95% efficiency for the kernel and 80% for the overall code. What else can I do to improve the performance, of at least the inner kernel?
       N   Total Gflops/s  Kernel Gflops/s
     256         7.778089         9.756284
     512         7.308523         9.462700
     768         7.223283         9.253639
    1024         7.197375         9.132235
    1280         7.142538         8.974122
    1536         7.114665         8.967249
    1792         7.060789         8.861958
Source Code

If you are feeling particularly magnanimous, please test my code on your machine. Compiled with gcc -std=c99 -O2 -m32 -msse3 -mincoming-stack-boundary=2 -masm=intel somatmul2.c -o somatmul2.exe. Feel free to try other flags, but I have found these to work best on my machine.
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <xmmintrin.h>
#include <math.h>
#include <omp.h>
#include <stdint.h>
#include <windows.h>

#ifndef min
    #define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) rpack(        double* restrict dst, 
          const double* restrict src, 
            const int kc, const int mc, const int mr, const int n)
{
    double tmp[mc*kc] __attribute__ ((aligned(64)));
    double* restrict ptr = &tmp[0];

    for (int i = 0; i < mc; ++i)
        for (int j = 0; j < kc; j+=4) {
            *ptr++ = *(src + i*n + j  );
            *ptr++ = *(src + i*n + j+1);
            *ptr++ = *(src + i*n + j+2);
            *ptr++ = *(src + i*n + j+3);
        }
    ptr = &tmp[0];

    //const int inc_dst = mr*kc;
    for (int k = 0; k < mc; k+=mr)
        for (int j = 0; j < kc; ++j)
            for (int i = 0; i < mr*kc; i+=kc)
                *dst++ = *(ptr + k*kc + j + i);

}

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64)))  cpack(double* restrict dst, 
                const double* restrict src, 
                const int nc, 
                const int kc, 
                const int nr, 
                const int n)
{ //nc cols, kc rows, nr size ofregister strips
    double tmp[kc*nc] __attribute__ ((aligned(64)));
    double* restrict ptr = &tmp[0];

    for (int i = 0; i < kc; ++i)
        for (int j = 0; j < nc; j+=4) {
            *ptr++ = *(src + i*n + j  );
            *ptr++ = *(src + i*n + j+1);
            *ptr++ = *(src + i*n + j+2);
            *ptr++ = *(src + i*n + j+3);
        }

    ptr = &tmp[0];

    // const int inc_k = nc/nr;
    for (int k = 0; k < nc; k+=nr)
        for (int j = 0; j < kc*nc; j+=nc)
            for (int i = 0; i < nr; ++i)
                *dst++ = *(ptr + k + i + j);

}

#define KERNEL0(add0,add1,add2,add3)                                \
            "mulpd      xmm4, xmm6                      \n\t" \
            "addpd      xmm0, xmm4                      \n\t" \
            "movapd     xmm4, XMMWORD PTR [edx+"#add2"] \n\t" \
            "mulpd      xmm7, xmm4                      \n\t" \
            "addpd      xmm1, xmm7                      \n\t" \
            "movddup        xmm5, QWORD PTR [eax+"#add0"]   \n\t" \
            "mulpd      xmm6, xmm5                      \n\t" \
            "addpd      xmm2, xmm6                      \n\t" \
            "movddup        xmm7, QWORD PTR [eax+"#add1"]   \n\t" \
            "mulpd      xmm4, xmm5                      \n\t" \
            "movapd     xmm6, XMMWORD PTR [edx+"#add3"] \n\t" \
            "addpd      xmm3, xmm4                      \n\t" \
            "movapd     xmm4, xmm7                      \n\t" \
            "                                           \n\t"

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) dgemm_2x4_asm_j
(   
        const int mc, const int nc, const int kc,
        const double* restrict locA, const int cs_a, // mr
        const double* restrict locB, const int rs_b, // nr
              double* restrict C, const int rs_c
    )
{

    const double* restrict a1 = locA;
    for (int i = 0; i < mc ; i+=cs_a) {
        const double* restrict b1 = locB;
        double* restrict c11 = C + i*rs_c;
        for (int j = 0; j < nc ; j+=rs_b) {

            __asm__ __volatile__
        (
            "mov        eax, %[a1]                  \n\t"
            "mov        edx, %[b1]                  \n\t"
            "mov        edi, %[c11]                 \n\t"
            "mov        ecx, %[kc]                  \n\t"
            "pxor       xmm0, xmm0                  \n\t"
            "movddup    xmm7, QWORD PTR [eax]       \n\t" // a1
            "pxor       xmm1, xmm1                  \n\t"
            "movapd     xmm6, XMMWORD PTR [edx]     \n\t" // b1
            "pxor       xmm2, xmm2                  \n\t"
            "movapd     xmm4, xmm7                  \n\t" // a1
            "pxor       xmm3, xmm3                  \n\t"
            "sar        ecx, 3                      \n\t" // divide by 2^num
            "L%=:                                   \n\t" // start loop
            KERNEL0(    8,      16,     16,     32)
            KERNEL0(    24,     32,     48,     64)
            KERNEL0(    40,     48,     80,     96)
            KERNEL0(    56,     64,     112,    128)
            KERNEL0(    72,     80,     144,    160)
            KERNEL0(    88,     96,     176,    192)
            KERNEL0(    104,    112,    208,    224)
            KERNEL0(    120,    128,    240,    256)
            "add        eax, 128                    \n\t"
            "add        edx, 256                    \n\t"
            "                                       \n\t"
            "dec        ecx                         \n\t"
            "jne        L%=                         \n\t" // end loop
            "                                       \n\t"
            "mov        esi, %[rs_c]                \n\t" // don't need cs_a anymore
            "sal        esi, 3                      \n\t" // times 8
            "lea        ebx, [edi+esi]              \n\t" // don't need b1 anymore
            "addpd      xmm0, XMMWORD PTR [edi]     \n\t" // c11
            "addpd      xmm1, XMMWORD PTR [edi+16]  \n\t" // c11 + 2
            "addpd      xmm2, XMMWORD PTR [ebx]     \n\t" // c11
            "addpd      xmm3, XMMWORD PTR [ebx+16]  \n\t" // c11 + 2
            "movapd     XMMWORD PTR [edi], xmm0     \n\t"
            "movapd     XMMWORD PTR [edi+16], xmm1  \n\t"
            "movapd     XMMWORD PTR [ebx], xmm2     \n\t"
            "movapd     XMMWORD PTR [ebx+16], xmm3  \n\t"
            : // no outputs 
            : // inputs
            [kc]    "m" (kc),
            [a1]    "m" (a1), 
            [cs_a]  "m" (cs_a),
            [b1]    "m" (b1),
            [rs_b]  "m" (rs_b),
            [c11]   "m" (c11),
            [rs_c]  "m" (rs_c)
            : //register clobber
            "memory",
            "eax","ebx","ecx","edx","esi","edi",
            "xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7"
            );
        b1 += rs_b*kc;
        c11 += rs_b;
        }
    a1 += cs_a*kc;
    }
}


double blis_dgemm_ref(
        const int n,
        const double* restrict A,
        const double* restrict B,
        double* restrict C,
        const int mc,
        const int nc,
        const int kc
    )
{
    int mr = 2;
    int nr = 4;
    double locA[mc*kc] __attribute__ ((aligned(64)));
    double locB[kc*nc] __attribute__ ((aligned(64)));

    LARGE_INTEGER frequency, time1, time2;
    double time3 = 0.0;
    QueryPerformanceFrequency(&frequency);

    // zero C
    memset(C, 0, n*n*sizeof(double));

    int ii,jj,kk;
    //#pragma omp parallel num_threads(2) shared(A,B,C) private(ii,jj,kk,locA,locB)
    {//use all threads in parallel
        //#pragma omp for
        for ( jj = 0; jj < n; jj+=nc) { 
            for ( kk = 0; kk < n; kk+=kc) {
                cpack(locB, B + kk*n + jj, nc, kc, nr, n);
                for ( ii = 0; ii < n; ii+=mc) {
                    rpack(locA, A + ii*n + kk, kc, mc, mr, n);
                            QueryPerformanceCounter(&time1);
                            dgemm_2x4_asm_j( mc, nc, kc,
                                       locA         ,  mr,
                                       locB         ,  nr,
                                       C + ii*n + jj,  n );
                            QueryPerformanceCounter(&time2);
                            time3 += (double) (time2.QuadPart - time1.QuadPart);
                }
            }
        }
    }
    return time3 / frequency.QuadPart;
}

double compute_gflops(const double time, const int n)
{
    // computes the gigaflops for a square matrix-matrix multiplication
    double gflops;
    gflops = (double) (2.0*n*n*n)/time/1.0e9;
    return(gflops);
}

void main() {
    LARGE_INTEGER frequency, time1, time2;
    double time3, best_time;
    double kernel_time, best_kernel_time;
    QueryPerformanceFrequency(&frequency);
    int best_flag;
    double gflops, kernel_gflops;

    const int trials = 100;
    int nmax = 4096;
    printf("%16s %16s %16s\n","N","Total Gflops/s","Kernel Gflops/s");

    int mc = 256;
    int kc = 256;
    int nc = 128;
    for (int n = kc; n <= nmax; n+=kc) {
        double *A = NULL;   
        double *B = NULL;
        double *C = NULL;

        A = _mm_malloc (n*n * sizeof(*A),64); if (!A) {printf("A failed\n"); return;}
        B = _mm_malloc (n*n * sizeof(*B),64); if (!B) {printf("B failed\n"); return;}
        C = _mm_malloc (n*n * sizeof(*C),64); if (!C) {printf("C failed\n"); return;}

        srand(time(NULL));

        // Create the matrices
        for (int i = 0; i < n; i++) {
            for (int j = 0; j < n; j++) {
                *(A + i*n + j) = (double) rand()/RAND_MAX;
                *(B + i*n + j) = (double) rand()/RAND_MAX;
            }
        }

            // warmup
            blis_dgemm_ref(n,A,B,C,mc,nc,kc);

            for (int count = 0; count < trials; count++){
                    QueryPerformanceCounter(&time1);
                    kernel_time = blis_dgemm_ref(n,A,B,C,mc,nc,kc);
                    QueryPerformanceCounter(&time2);
                    time3 = (double)(time2.QuadPart - time1.QuadPart) / frequency.QuadPart;
                if (count == 0) {
                    best_time = time3;
                    best_kernel_time = kernel_time;
                }
                else {
                    best_flag = ( time3 < best_time ? 1 : 0 );
                    if (best_flag) {
                        best_time = time3;
                        best_kernel_time = kernel_time;
                    }
                }
            }
            gflops = compute_gflops(best_time, n);
            kernel_gflops = compute_gflops(best_kernel_time, n);
            printf("%16d %16f %16f\n",n,gflops,kernel_gflops);

        _mm_free(A);
        _mm_free(B);
        _mm_free(C);
    }
    printf("tests are done\n");

}
EDIT

Replace the packing functions with the following new and faster versions:
inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64))) rpack(        double* restrict dst, 
          const double* restrict src, 
            const int kc, const int mc, const int mr, const int n)
{
    for (int i = 0; i < mc/mr; ++i)
        for (int j = 0; j < kc; ++j)
            for (int k = 0; k < mr; ++k)
                *dst++ = *(src + i*n*mr + k*n + j);
}

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(64)))  cpack(double* restrict dst, 
                const double* restrict src, 
                const int nc, 
                const int kc, 
                const int nr, 
                const int n)
{   
    for (int i = 0; i < nc/nr; ++i)
        for (int j = 0; j < kc; ++j)
            for (int k = 0; k < nr; ++k)
                *dst++ = *(src + i*nr + j*n + k);
}
Results With New Packing Functions

Nice boost to the overall performance:
       N   Total Gflops/s  Kernel Gflops/s
     256         7.915617         8.794849
     512         8.466467         9.350920
     768         8.354890         9.135575
    1024         8.168944         8.884611
    1280         8.174249         8.825920
    1536         8.285458         8.938712
    1792         7.988038         8.581001
LINPACK 32-bit with 1 thread
CPU frequency:    2.792 GHz
Number of CPUs: 1
Number of cores: 2
Number of threads: 1

Performance Summary (GFlops)
Size   LDA    Align.  Average  Maximal
128    128    4       4.7488   5.0094  
256    256    4       6.0747   6.9652  
384    384    4       6.5208   7.2767  
512    512    4       6.8329   7.5706  
640    640    4       7.4278   7.8835  
768    768    4       7.7622   8.0677  
896    896    4       7.8860   8.4737  
1024   1024   4       7.7292   8.1076  
1152   1152   4       8.0411   8.4738  
1280   1280   4       8.1429   8.4863  
1408   1408   4       8.2284   8.7073  
1536   1536   4       8.3753   8.6437  
1664   1664   4       8.6993   8.9108  
1792   1792   4       8.7576   8.9176  
1920   1920   4       8.7945   9.0678  
2048   2048   4       8.5490   8.8827  
2176   2176   4       9.0138   9.1161  
2304   2304   4       8.1402   9.1446  
2432   2432   4       9.0003   9.2082  
2560   2560   4       8.8560   9.1197  
2688   2688   4       9.1008   9.3144  
2816   2816   4       9.0876   9.3089  
2944   2944   4       9.0771   9.4191  
3072   3072   4       8.9402   9.2920  
3200   3200   4       9.2259   9.3699  
3328   3328   4       9.1224   9.3821  
3456   3456   4       9.1354   9.4082  
3584   3584   4       9.0489   9.3351  
3712   3712   4       9.3093   9.5108  
3840   3840   4       9.3307   9.5324  
3968   3968   4       9.3895   9.5352  
4096   4096   4       9.3269   9.3872  
EDIT 2

Here are some results from single-threaded OpenBLAS, which took 4 hours to compile last night. As you can see, it is getting close to 95% CPU usage. Max single-threaded performance with both cores on is 11.2 Gflops (2 step Intel Turbo Boost). I need to turn off the other core to get 12.26 Gflops (4 step Intel Turbo Boost). Assume that the packing functions in OpeBLAS generate no additional overhead. Then the OpenBLAS kernel must be running at least as fast as the total OpenBLAS code. So I need to get my kernel running at that speed. I have yet to figure out how to make my assembly faster. I will be focusing on this over the next few days.

Ran the tests below from Windows command line with: start /realtime /affinity 1
My Code:
   N   Total Gflops/s  Kernel Gflops/s
       256   7.927740   8.832366
       512   8.427591   9.347094
       768   8.547722   9.352993
      1024   8.597336   9.351794
      1280   8.588663   9.296724
      1536   8.589808   9.271710
      1792   8.634201   9.306406
      2048   8.527889   9.235653
OpenBLAS:
   N   Total Gflops/s
   256  10.599065
   512  10.622686
   768  10.745133
  1024  10.762757
  1280  10.832540
  1536  10.793132
  1792  10.848356
  2048  10.819986

 解决方案 
It's theoretically possible to look at that code and reason through whether it could be arranged to make better use of microarchitectural resources - but even the performance architects at Intel might not recommend doing it that way.  It might help to use a tool like VTune or Intel Performance Counter Monitor to find out how much of your workload is memory versus front-end versus back-end bound.  Intel Architecture Code Analyzer might also be a quick source of help narrowing down which of the potential issues listed below to follow up on first.

Nominal Animal is probably on the right track in the comments talking about interleaving instructions that access memory and those that do computation.  A few other possibilities:


Using other instructions for some of the computation might reduce pressure on one of the execution ports (see section 3.3.4 of this presentation).  In particular, mulpd is always going to dispatch to port 1 on Westmere.  Maybe if there are any cycles where port 0 isn't getting used, you could sneak in a scalar FP multiply there.
One or another of the hardware prefetchers could be saturating the bus early or polluting the cache with lines you don't end up using.
On the other hand, there's a slim possibility that the ordering of memory references or the memory layout implied in dgemm_2x4_asm_j is faking out the prefetchers.
Changing the relative ordering of pairs of instructions that don't have any data dependencies might lead to better use of front-end or back-end resources.


                        这篇关于复制BLAS矩阵乘法的表现：我能与之匹敌？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

复制BLAS矩阵乘法的表现：我能与之匹敌？ [英] Replicating BLAS matrix multiplication performance: Can I match it?

问题描述

相关文章

.NET Framework最新文章

热门教程

热门工具

登录关闭

复制BLAS矩阵乘法的表现：我能与之匹敌？ [英] Replicating BLAS matrix multiplication performance: Can I match it?

问题描述

相关文章

.NET Framework最新文章

热门教程

热门工具

登录 关闭

登录关闭