获取有关的Haswell峰值带宽的L1缓存:只获得62% [英] Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

查看:445
本文介绍了获取有关的Haswell峰值带宽的L1缓存:只获得62%的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图获得L1缓存为英特尔处理器下面的函数

全带宽

 浮动黑社会(浮点* X,浮动* Y,浮动* Z,const int的N){
    浮K = 3.14159f;
    的for(int i = 0; I< N;我++){
        Z [i] = X [I] + K * Y [I]
    }
}

这是从 STREAM 黑社会功能。

我得到的SandyBridge / IvyBridge的处理器的峰值大约95%使用此功能(使用汇编与NASM)。但是,使用的Haswell我,除非我展开循环达到峰值的62%。如果我解开16次,我得到了92%。我不明白这一点。

我决定写我的功能使用NASM汇编。在装配主循环看起来像这样。

  .L2:
    vmovaps ymm1,[RDI + RAX]
    vfmadd231ps ymm1,ymm2,[RSI + RAX]
    vmovaps [RDX + RAX],ymm1
    添加RAX,32
    JNE .L2

原来在瓦格纳雾的优化组件手册的例子12.7-12.11他做几乎同样的事情(但 Y [i] = Y [I] + K * X [I] )。我设法复制他的code或多或少对我自己的(实际上他有一个小的时候,他在广播FMA3示例bug)。他给了指令长度计数,融合OPS,执行端口为除FMA4和FMA3每个处理器表。我试图让这个表我自己FMA3。

 港口
             大小μops稠0 1 2 3 4 5 6 7
vmovaps 5 1½½
vfmadd231ps 6 1½½½½
vmovaps 5 1 1 1
加4½½
JNE 2½½
-------------------------------------------------- ------------
共有22½4½1 1 1 0 1 1

尺寸指的是在字节指令长度。究其原因,添加 JNE 指令已经半个μop是他们得到融合成一个宏指令(不被混淆仍然使用多个端口μop融合),只需要6端口和一个μop。的 vfmadd231ps 指令可以使用​​端口0或端口1。我选择端口0的负载 vmovaps 可以使用端口2或3。我选择了2,不得不 vfmadd231ps 使用端口3。。为了与瓦格纳雾的表一致的,因为我认为它更有意义说哪个可以去不同的端口的指令同样进入每一个1/2的时候,给我分配了1/2的端口 vmovaps vmadd231ps 可以去。

根据此表和事实,即所有2处理器可以做四μops看来这个循环应该可以在每个时钟周期,但我没有设法得到它每个时钟周期。 有人可以请向我解释为什么我不能没有展开去接近峰值带宽上的Haswell这个功能呢?这是可能没有展开,如果是这样如何能不能做到?让我清楚,我真的想最大限度的ILP此功能(我不只是想最大带宽),所以这是我之所以不想解开。

编辑:
下面是一个更新,因为Iwillnotexist Idonotexist使用IACA的商店从不使用端口7.我设法突破66%的关口没有展开,并为此在一个时钟周期每一次迭代没有展开(理论上)显示。让我们先地址店里问题。

斯蒂芬佳能在评论中提到的,在港口7地址产生单元(AGU)只能处理简单的操作,如 [基地+偏移] ,而不是 [基地+指数] 。在<一个href=\"http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html\">Intel优化参考手册我发现的唯一的事情是在PORT7它说Simple_AGU,没有什么简单的方法定义的注释。但随后Iwillnotexist Idonotexist在 IACA的意见中发现这个问题已经提到六个月前在英特尔员工在2014年3月11日写道:


  

PORT7 AGU可以存储与简单的内存地址(无索引寄存器)才能正常工作。


斯蒂芬佳能建议应用商店地址作为负载操作数的偏移量。我已经试过这类似这样

  vmovaps ymm1,[RDI + R9 + 32 * I]
vfmadd231ps ymm1,ymm2,[RSI + R9 + 32 * I]
vmovaps [R9 + 32 * I],ymm1
添加R9,32 *展开
CMP R9,RCX
JNE .L2

这的确会导致商店使用PORT7。然而,它有另一个问题,就是在该 vmadd231ps 不使用它可以从IACA看到负载熔化。它也还需要 CMP 指令,我原来的功能没有。所以店里少使用一个微操作,但 CMP (或者更确切地说,那么添加因为 CMP JNE宏观保险丝)需要多一个。 IACA报告块吞吐量150。在实践中,这只能获得峰值的57%。

不过,我发现了一个办法让 vmadd231ps 指令与负载融合为好。这只能使用静态数组与寻址[绝对的32位地址+索引]这样做。 叶夫根Kluev原建议此

  vmovaps ymm1,[src1_end + RAX]
vfmadd231ps ymm1,ymm2,[src2_end + RAX]
vmovaps [dst_end + RAX],ymm1
添加RAX,32
JL .L2

其中, src1_end src2_end dst_end 是静态数组的结束地址。

这再现了表中我有四个融合的微操作,我期待的问题。如果你把这个IACA它报告块吞吐量1.0。理论上,这应该做的,以及上证所和AVX版本。在实践中它得到了峰值的72%。这突破了66%的关口,但它仍然从92%还很远我得到展开的16倍。因此,对Haswell的唯一的选择去接近峰值位于展开。这是没有必要的,通过常春藤桥核2,但它是上的Haswell。

End_edit:

下面是C / C ++的Linux code来进行测试。在NASM code是C / C ++ code后公布。你必须改变的唯一一件事是频率数​​。在行倍频= 1.3; 更换1.3与任何操作系统的处理器(未标称)频率(这情况下,为i5-4250U采用涡轮增压中禁用BIOS是1.3千兆赫)。

与编译

  NASM -f ELF64 triad_sse_asm.asm
NASM -f ELF64 triad_avx_asm.asm
NASM -f ELF64 triad_fma_asm.asm
G ++ -m64 -lrt -O3 -mfma tests.cpp triad_fma_asm.o -o tests_fma
G ++ -m64 -lrt -O3 -mavx tests.cpp triad_avx_asm.o -o tests_avx
G ++ -m64 -lrt -O3 -msse2 tests.cpp triad_sse_asm.o -o tests_sse

在C / C ++ code

 的#include&LT; x86intrin.h&GT;
#包括LT&;&stdio.h中GT;
#包括LT&;&string.h中GT;
#包括LT&;&time.h中GT;#定义TIMER_TYPE的CLOCK_REALTIME为externC浮动triad_sse_asm_repeat(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复);
为externC浮动triad_sse_asm_repeat_unroll16(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复);
为externC浮动triad_avx_asm_repeat(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复);
为externC浮动triad_avx_asm_repeat_unroll16(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复);
为externC浮动triad_fma_asm_repeat(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复);
为externC浮动triad_fma_asm_repeat_unroll16(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复);#如果(定义(__ FMA__))
浮triad_fma_repeat(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复){
    浮K = 3.14159f;
    INT R;
    为(R = 0;为r重复; R ++){
        INT I;
        __m256 K4 = _mm256_set1_ps(K);
        对于(i = 0; I&LT; N,I + = 8){
            _mm256_store_ps(安培; Z [I],_mm256_fmadd_ps(K4,_mm256_load_ps(安培;值Y [i]),_mm256_load_ps(安培; X [I])));
        }
    }
}
#elif指令(定义(__ AVX__))
浮triad_avx_repeat(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复){
    浮K = 3.14159f;
    INT R;
    为(R = 0;为r重复; R ++){
        INT I;
        __m256 K4 = _mm256_set1_ps(K);
        对于(i = 0; I&LT; N,I + = 8){
            _mm256_store_ps(安培; Z [I],_mm256_add_ps(_mm256_load_ps(安培; X [I]),_mm256_mul_ps(K4,_mm256_load_ps(安培;值Y [i]))));
        }
    }
}
#其他
浮triad_sse_repeat(浮点* X,浮动* Y,浮动* Z,const int的N,INT重复){
    浮K = 3.14159f;
    INT R;
    为(R = 0;为r重复; R ++){
        INT I;
        __m128 K4 = _mm_set1_ps(K);
        对于(i = 0; I&LT; N,I + = 4){
            _mm_store_ps(安培; Z [I],_mm_add_ps(_mm_load_ps(安培; X [I]),_mm_mul_ps(K4,_mm_load_ps(安培;值Y [i]))));
        }
    }
}
#万一双time_diff(开始的timespec,结束的timespec)
{
    温度的timespec;
    如果((end.tv_nsec-start.tv_nsec)小于0){
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 10亿+ end.tv_nsec,start.tv_nsec;
    }其他{
        temp.tv_sec = end.tv_sec,start.tv_sec;
        temp.tv_nsec = end.tv_nsec,start.tv_nsec;
    }
    回报(双)temp.tv_sec +(双)temp.tv_nsec * 1E-9;
}诠释主(){
    INT bytes_per_cycle = 0;
    双频率= 1.3; //的Haswell
    //双频率= 3.6; // IB
    //倍频= 2.66; //酷睿2
    #如果(定义(__ FMA__))
    bytes_per_cycle = 96;
    #elif指令(定义(__ AVX__))
    bytes_per_cycle = 48;
    #其他
    bytes_per_cycle = 24;
    #万一
    双峰=频率* bytes_per_cycle;    const int的N = 2048;    浮动* Z2 =(浮点*)_ mm_malloc(的sizeof(浮动)* N,64);
    字符*纪念品=(的char *)_ mm_malloc(1 LT;&LT; 18,4096);
    字符* A = MEM;
    字符* B = A + N * sizeof的(浮动);
    字符* C = B + N * sizeof的(浮动);    浮动* X =(浮点*)一个;
    浮* Y =(浮点*)B:
    浮动* Z =(浮点*)C;    的for(int i = 0; I&LT; N;我++){
        X [i] = 1.0F *我;
        值Y [i] = 1.0F *我;
        Z [I] = 0;
    }
    INT重复= 1000000;
    的timespec时间1,时间2;
    #如果(定义(__ FMA__))
    triad_fma_repeat(X,Y​​,Z 2,N,重复);
    #elif指令(定义(__ AVX__))
    triad_avx_repeat(X,Y​​,Z 2,N,重复);
    #其他
    triad_sse_repeat(X,Y​​,Z 2,N,重复);
    #万一    而(1){
        双DTIME,率;        clock_gettime(TIMER_TYPE的,和放大器;时间1);
        #如果(定义(__ FMA__))
        triad_fma_asm_repeat(X,Y​​,Z,N,重复);
        #elif指令(定义(__ AVX__))
        triad_avx_asm_repeat(X,Y​​,Z,N,重复);
        #其他
        triad_sse_asm_repeat(X,Y​​,Z,N,重复);
        #万一
        clock_gettime(TIMER_TYPE的,和放大器;时间2);
        DTIME = time_diff(时间1,时间2);
        率= 3.0 * 1E-9 *的sizeof(浮动)* N *重复/ DTIME;
        的printf(unroll1率%6.2f GB /秒,efficency%6.2f %%,误差%d个\\ N,速度,100 *率/峰值,memcmp(Z,Z2的sizeof(浮动)* N));
        clock_gettime(TIMER_TYPE的,和放大器;时间1);
        #如果(定义(__ FMA__))
        triad_fma_repeat(X,Y​​,Z,N,重复);
        #elif指令(定义(__ AVX__))
        triad_avx_repeat(X,Y​​,Z,N,重复);
        #其他
        triad_sse_repeat(X,Y​​,Z,N,重复);
        #万一
        clock_gettime(TIMER_TYPE的,和放大器;时间2);
        DTIME = time_diff(时间1,时间2);
        率= 3.0 * 1E-9 *的sizeof(浮动)* N *重复/ DTIME;
        的printf(禀增长率%6.2f GB /秒,efficency%6.2f %%,误差%d个\\ N,速度,100 *率/峰值,memcmp(Z,Z2的sizeof(浮动)* N));
        clock_gettime(TIMER_TYPE的,和放大器;时间1);
        #如果(定义(__ FMA__))
        triad_fma_asm_repeat_unroll16(X,Y,Z,N,重复);
        #elif指令(定义(__ AVX__))
        triad_avx_asm_repeat_unroll16(X,Y,Z,N,重复);
        #其他
        triad_sse_asm_repeat_unroll16(X,Y,Z,N,重复);
        #万一
        clock_gettime(TIMER_TYPE的,和放大器;时间2);
        DTIME = time_diff(时间1,时间2);
        率= 3.0 * 1E-9 *的sizeof(浮动)* N *重复/ DTIME;
        的printf(unroll16率%6.2f GB /秒,efficency%6.2f %%,误差%d个\\ N,速度,100 *率/峰值,memcmp(Z,Z2的sizeof(浮动)* N));
    }
}

使用System V的AMD64 ABI的NASM code。

triad_fma_asm.asm:

 全球triad_fma_asm_repeat
; RDI X,RSI Y,RDX Z,RCX N,R8重复
; Z [i] = Y [I] + 3.14159 * X [I]
PI:DD 3.14159
; 16对齐
.text段
    triad_fma_asm_repeat:
    SHL RCX,2
    加RDI,RCX
    添加RSI,RCX
    添加RDX,RCX
    vbroadcastss ymm2,[REL PI]
    ;负RCX16对齐
.L1:
    MOV RAX,RCX
    NEG RAX
16对齐
.L2:
    vmovaps ymm1,[RDI + RAX]
    vfmadd231ps ymm1,ymm2,[RSI + RAX]
    vmovaps [RDX + RAX],ymm1
    添加RAX,32
    JNE .L2
    子r8d,1
    JNZ .L1
    vzeroupper
    RET全球triad_fma_asm_repeat_unroll16
.text段
    triad_fma_asm_repeat_unroll16:
    SHL RCX,2
    添加RCX,RDI
    vbroadcastss ymm2,[REL PI]
.L1:
    XOR RAX,RAX
    MOV R9,RDI
    MOV R10,RSI
    MOV R11,RDX
.L2:
    %分配解开32
    %分配I 0
    %代表解开
        vmovaps ymm1,[R9 + 32 * I]
        vfmadd231ps ymm1,ymm2,[R10 + 32 * I]
        vmovaps [R11 + 32 * I],ymm1
    %分配I I + 1
    %endrep
    添加R9,32 *展开
    加入R10,32 *展开
    加入R11,32 *展开
    CMP R9,RCX
    JNE .L2
    子r8d,1
    JNZ .L1
    vzeroupper
    RET

triad_ava_asm.asm:

 全球triad_avx_asm_repeat
; RDI X,RSI Y,RDX Z,RCX N,R8重复
PI:DD 3.14159
16对齐
.text段
    triad_avx_asm_repeat:
    SHL RCX,2
    加RDI,RCX
    添加RSI,RCX
    添加RDX,RCX
    vbroadcastss ymm2,[REL PI]
    ;负RCX16对齐
.L1:
    MOV RAX,RCX
    NEG RAX
16对齐
.L2:
    vmulps ymm1,ymm2,[RDI + RAX]
    vaddps ymm1,ymm1,[RSI + RAX]
    vmovaps [RDX + RAX],ymm1
    添加RAX,32
    JNE .L2
    子r8d,1
    JNZ .L1
    vzeroupper
    RET全球triad_avx_asm_repeat2
; RDI X,RSI Y,RDX Z,RCX N,R8重复
; PI:DD 3.14159
16对齐
.text段
    triad_avx_asm_repeat2:
    SHL RCX,2
    vbroadcastss ymm2,[REL PI]16对齐
.L1:
    XOR RAX,RAX
16对齐
.L2:
    vmulps ymm1,ymm2,[RDI + RAX]
    vaddps ymm1,ymm1,[RSI + RAX]
    vmovaps [RDX + RAX],ymm1
    添加EAX,32
    CMP EAX,ECX
    JNE .L2
    子r8d,1
    JNZ .L1
    vzeroupper
    RET全球triad_avx_asm_repeat_unroll16
16对齐
.text段
    triad_avx_asm_repeat_unroll16:
    SHL RCX,2
    添加RCX,RDI
    vbroadcastss ymm2,[REL PI]
16对齐
.L1:
    XOR RAX,RAX
    MOV R9,RDI
    MOV R10,RSI
    MOV R11,RDX
16对齐
.L2:
    %分配解开16
    %分配I 0
    %代表解开
        vmulps ymm1,ymm2,[R9 + 32 * I]
        vaddps ymm1,ymm1,[R10 + 32 * I]
        vmovaps [R11 + 32 * I],ymm1
    %分配I I + 1
    %endrep
    添加R9,32 *展开
    加入R10,32 *展开
    加入R11,32 *展开
    CMP R9,RCX
    JNE .L2
    子r8d,1
    JNZ .L1
    vzeroupper
    RET

triad_sse_asm.asm:

 全球triad_sse_asm_repeat
; RDI X,RSI Y,RDX Z,RCX N,R8重复
PI:DD 3.14159
; 16对齐
.text段
    triad_sse_asm_repeat:
    SHL RCX,2
    加RDI,RCX
    添加RSI,RCX
    添加RDX,RCX
    MOVSS XMM2,[REL PI]
    SHUFPS XMM2,XMM2,0
    ;负RCX
16对齐
.L1:
    MOV RAX,RCX
    NEG RAX
16对齐
.L2:
    MOVAPS将xmm1,[RDI + RAX]
    次MULPS将xmm1,XMM2
    ADDPS将xmm1,[RSI + RAX]
    MOVAPS [RDX + RAX],将xmm1
    添加RAX,16
    JNE .L2
    子r8d,1
    JNZ .L1
    RET全球triad_sse_asm_repeat2
; RDI X,RSI Y,RDX Z,RCX N,R8重复
; PI:DD 3.14159
; 16对齐
.text段
    triad_sse_asm_repeat2:
    SHL RCX,2
    MOVSS XMM2,[REL PI]
    SHUFPS XMM2,XMM2,0
16对齐
.L1:
    XOR RAX,RAX
16对齐
.L2:
    MOVAPS将xmm1,[RDI + RAX]
    次MULPS将xmm1,XMM2
    ADDPS将xmm1,[RSI + RAX]
    MOVAPS [RDX + RAX],将xmm1
    添加EAX,16
    CMP EAX,ECX
    JNE .L2
    子r8d,1
    JNZ .L1
    RET全球triad_sse_asm_repeat_unroll16
.text段
    triad_sse_asm_repeat_unroll16:
    SHL RCX,2
    添加RCX,RDI
    MOVSS XMM2,[REL PI]
    SHUFPS XMM2,XMM2,0
.L1:
    XOR RAX,RAX
    MOV R9,RDI
    MOV R10,RSI
    MOV R11,RDX
.L2:
    %分配解开8
    %分配I 0
    %代表解开
        MOVAPS将xmm1,[R9 + 16 * I]
        次MULPS将xmm1,XMM2,
        ADDPS将xmm1,[R10 + 16 * I]
        MOVAPS [R11 + 16 * I],将xmm1
    %分配I I + 1
    %endrep
    添加R9,16 *展开
    加入R10,16 *展开
    加入R11,16 *展开
    CMP R9,RCX
    JNE .L2
    子r8d,1
    JNZ .L1
    RET


解决方案

IACA分析

使用 IACA(英特尔架构code分析器)显示,宏指令融合确实发生,并且它是没有问题的。这是Mysticial谁是正确的:的问题是,这家店是不是在所有使用端口7的

IACA报告如下:

 英特尔(R)体系结构code分析版本 -  2.1
分析文件 - ../../../tests_fma
二进制格式 - 64位
建筑 - HSW
分析类型 - 吞吐量吞吐量分析报告
--------------------------
座吞吐量:1.55周期吞吐瓶颈:前端,PORT2_AGU,PORT3_AGU端口绑定在循环迭代每:
-------------------------------------------------- -------------------------------------
|港| 0 - DV | 1 | 2 - 开发|的3 - D | 4 | 5 | 6 | 7 |
-------------------------------------------------- -------------------------------------
|循环| 0.5 0.0 | 0.5 | 1.5 1.0 | 1.5 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
-------------------------------------------------- -------------------------------------N - 端口号或循环资源冲突的数量造成的延误,DV - 除法管(端口0)
ð - 数据取管(在端口2和3),CP - 关键路径上
的F - 宏融合中发生了previous指令
* - 指令微操作不绑定到一个端口
^ - 微融合发生
# - ESP跟踪同步UOP发出
@ - SSE指令后面的指令AVX256,数十周期刑罚有望
! - 不支持的指令,没有算在分析|的num |端口pressure在周期| |
|微指令| 0 - DV | 1 | 2 - 开发|的3 - D | 4 | 5 | 6 | 7 | |
-------------------------------------------------- -------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [RDI + RAX * 1]
| 2 | 0.5 | 0.5 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [RSI + RAX * 1]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | CP | vmovaps ymmword PTR [RDX + RAX * 1],ymm1
| 1 | | | | | | | 1.0 | | |添加RAX,为0x20
| 0F | | | | | | | | | | JNZ 0xffffffffffffffec
总民微指令:6

在特别是,在周期的报告块可以通过(1.5)jives非常好,66%的效率。

IACA自己的网站上对此很现象一个职位在周二,2014年3月11日 - 12:39 周二,2014年3月11日会见了由一名英特尔员工这个答复 - 23 :20


  

    

PORT7 AGU可以存储与简单的内存地址(无索引寄存器)才能正常工作。这就是为什么上面的分析不显示PORT7利用率。


  

这坚定落户为什么没有被使用的端口7。

现在,对比上面用32X展开循环(事实证明 unroll16 shoudl实际上被称为 unroll32

 英特尔(R)体系结构code分析版本 -  2.1
分析文件 - ../../../tests_fma
二进制格式 - 64位
建筑 - HSW
分析类型 - 吞吐量吞吐量分析报告
--------------------------
座吞吐量:32.00周期吞吐瓶颈:PORT2_AGU,Port2_DATA,PORT3_AGU,Port3_DATA,端口4,PORT7端口绑定在循环迭代每:
-------------------------------------------------- -------------------------------------
|港| 0 - DV | 1 | 2 - 开发|的3 - D | 4 | 5 | 6 | 7 |
-------------------------------------------------- -------------------------------------
|循环| 16.0 0.0 | 16.0 | 32.0 32.0 | 32.0 32.0 | 32.0 | 2.0 | 2.0 | 32.0 |
-------------------------------------------------- -------------------------------------N - 端口号或循环资源冲突的数量造成的延误,DV - 除法管(端口0)
ð - 数据取管(在端口2和3),CP - 关键路径上
的F - 宏融合中发生了previous指令
* - 指令微操作不绑定到一个端口
^ - 微融合发生
# - ESP跟踪同步UOP发出
@ - SSE指令后面的指令AVX256,数十周期刑罚有望
! - 不支持的指令,没有算在分析|的num |端口pressure在周期| |
|微指令| 0 - DV | 1 | 2 - 开发|的3 - D | 4 | 5 | 6 | 7 | |
-------------------------------------------------- -------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x20的]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x20的]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 +为0x20],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0X40]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0X40]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0X40],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x60的]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x60的]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x60的],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x80的]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x80的]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x80的],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0XA0]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0XA0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0XA0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 +将0xC0]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 +将0xC0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 +将0xC0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0xe0的]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0xe0的]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0xe0的],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x100处]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x100处]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x100处],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 +量0x120]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 +量0x120]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 +量0x120],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x140]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x140]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x140],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x160]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x160]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x160],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 +量0x180]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 +量0x180]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 +量0x180],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 +量0x1A0]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 +量0x1A0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 +量0x1A0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x1c0]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x1c0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x1c0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x1e0]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x1e0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x1e0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 +为0x200]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 +为0x200]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 +为0x200],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x220]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x220]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x220],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x240]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x240]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x240],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x260]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x260]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x260],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x280处]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x280处]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x280处],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x2a0]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x2a0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x2a0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x2c0]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x2c0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x2c0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x2e0]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x2e0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x2e0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 +是0x300]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 +是0x300]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 +是0x300],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0×320]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0×320]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0×320],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x340]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x340]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x340],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x360的]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x360的]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x360的],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x380]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x380]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x380],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x3a0]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x3a0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x3a0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x3c0]
| 2 ^ | 1.0 | | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x3c0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x3c0],ymm1
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1,ymmword PTR [R9 + 0x3e0]
| 2 ^ | | 1.0 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1,ymm2,ymmword PTR [R10 + 0x3e0]
| 2 ^ | | | | | 1.0 | | | 1.0 | CP | vmovaps ymmword PTR [R11 + 0x3e0],ymm1
| 1 | | | | | | 1.0 | | | |添加R9,0x400的
| 1 | | | | | | | 1.0 | | |加入R10,0x400的
| 1 | | | | | | 1.0 | | | |加入R11,0x400的
| 1 | | | | | | | 1.0 | | | CMP R9,RCX
| 0F | | | | | | | | | | JNZ 0xfffffffffffffcaf
总民微指令作者:164

我们在这里看到的微融合与实体店的正确调度端口7。

手动分析(见上编辑)

我现在可以回答你的第二个问题:没有展开,如果是这样怎么能这样做可能?。答案是否定的。

我补齐阵列 X 以Z 到左和右用大量缓冲液用于下面的实验中,并改变了内环为以下:

  .L2:
vmovaps ymm1,[RDI + RAX]; 1L
vmovaps ymm0,[RSI + RAX]; 2L
vmovaps [RDX + RAX],ymm2; S1
添加RAX,32;加
JNE .L2; JMP

这故意不使用FMA(仅加载和存储)和所有加载/存储指令没有依赖关系,因为因此应该没有什么危害preventing自己的问题变成任何执行端口。

然后我测试的第一和第二负载( 1L 2L ),商店的每一个排列( S1 )和附加( A ),同时使条件跳转(Ĵ)结尾,并为每个I测试 x的偏移量的每个可能的组合以Z 0或-32字节(纠正的事实,重排一前加RAX,32 中的R + R 指标会导致加载或存储到目标错误的地址)。中,环管对准〜32字节。这些测试是在与TurboBoost睿2.4GHz的i7-4700MQ通过手段回声'0'&GT禁止运行; / SYS /设备/系统/ CPU / CPU频率/升压 Linux下,用2.4的频率不变。这里有效率的结果(最多24 的):

 案例:0 1 2 3 4 5 6 7
       L1 L2小号L1 L2小号L1 L2小号L1 L2小号L1 L2小号L1 L2小号L1 L2小号L1 L2小号
       -0 -0 -0 -0 -0 -0 -32 -32 -0 -0 -32 -32 -32 -0 -0 -0 -32 -32 -32 -32 -32 -0 -32 -32
       ________________________________________________________________________________________________
12SAJ:65.34%65.34%49.63%65.07%49.70%65.05%49.22%65.07%
12ASJ:48.59%64.48%48.74%49.69%48.75%49.69%48.99%48.60%
1A2SJ:49.69%64.77%48.67%64.06%49.69%49.69%48.94%49.69%
1AS2J:48.61%64.66%48.73%49.71%48.77%49.69%49.05%48.74%
1S2AJ:49.66%65.13%49.49%49.66%48.96%64.82%49.02%49.66%
1SA2J:64.44%64.69%49.69%64.34%49.69%64.41%48.75%64.14%
21SAJ:65.33%,65.34 * 49.70%65.06%49.62%65.07%49.22%65.04%%
21ASJ:可以想像= 12ASJ
2A1SJ:可以想像= 1A2SJ
2AS1J:可以想像= 1AS2J
2S1AJ:可以想像= 1S2AJ
2SA1J:可以想像= 1SA2J
S21AJ:48.91%65.19%49.04%49.72%49.12%49.63%49.21%48.95%
S2A1J:可以想像= S1A2J
SA21J:可以想像= SA12J
SA12J:64.6​​9%64.93%49.70%64.66%49.69%64.27%48.71%64.56%
S12AJ:48.90%65.20%49.12%49.63%49.03%49.70%49.21%48.94 *%
S1A2J:49.69%64.74%48.65%64.48%49.43%49.69%48.66%49.69%
A2S1J:可以想像= A1S2J
A21SJ:可以想像= A12SJ
A12SJ:64.6​​2%64.45%49.69%64.57%49.69%64.45%48.58%63.99%
A1S2J:49.72%64.69%49.72%49.72%48.67%64.46%48.95%49.72%
AS21J:可以想像= AS21J
AS12J:48.71%64.53%48.76%49.69%48.76%49.74%48.93%48.69%

我们可以看到从表中的几件事情:


  • 结果的几个高原,但只有两个主要的:略低于50%和65%左右

  • L1和L2可以自由地互相之间不影响结果重排。

  • 由-32字节偏移访问的可以的改变效率。

  • 我们感兴趣的是(负载1,负载2,存储1和跳转随地添加他们周围和正确应用-32偏移)的模式都是一样的,都在较高的平台:

    • 12SAJ 案例0(无偏移应用),具有效率65.34%(最高)

    • 12ASJ 案例1( S-32 ),具有效率64.48%

    • 1A2SJ 案例3( 2L-32 S-32 ),具有效率64.06%

    • A12SJ 案例7( 1L-32 2L-32 S-32 ),具有效率63.99%


  • 总是存在对于每个排列,允许在效率较高的平台执行的至少一个的情况下。尤其是案例1(其中 S-32 )似乎保证这一点。

  • 案例2,4,并在高原低保证6执行。它们的共同点是负载的任一个或两个是由-32而存储不是偏移。

  • 对于例0,3,5和7,这取决于置换。

从那里我们可以得出至少有几个结论:


  • 执行端口2和3真的不在乎它们生成和负载从加载地址。

  • 添加 JMP 出现通过的指示任何排列unimpacted(特别是根据宏指令融合案例1抵消),导致我相信@Evgeny Kluev的结论是不正确的:在添加 JNE 做的的出现影响他们的融合。我现在说的Haswell ROB正确处理这个很合理确定。

    • 什么叶夫根在看(从 12SAJ 中的案例0要与效率65%的人有效率49%)是由于效果只是为了地址的值loaded and stored from, and not due to an inability of the core to macro-fuse the add and branch.

    • Further, macro-op fusion must be occurring at least some of the time, since the average loop time is 1.5 CC. If macro-op fusion did not occur this would be 2CC minimum.


  • Having tested all valid and invalid permutations of instructions within the not-unrolled loop, we've seen nothing higher than 65.34%. This answers empirically with a \"no\" the question of whether it is possible to use the full bandwidth without unrolling.

I will hypothesize several possible explanations:


  • We're seeing some wierd perversion due to the value of the addresses relative to each other.

    • If so then there would exist a set of offsets of <$c$c>x, <$c$c>y and <$c$c>z that would allow maximum throughput. Quick random tests on my part seem not to support this.


  • We're seeing the loop run in one-two-step mode; The loop iterations alternate running in one clock cycle, then two.


    • This could be macro-op fusion being affected by the de$c$crs. From Agner Fog:

      Fuseable arithmetic/logic instructions cannot be de$c$cd in the last of the four de$c$crs on Sandy Bridge and Ivy Bridge processors. I have not tested whether this also applies to the Haswell.


    • Alternately, every other clock cycle an instruction is issued to the \"wrong\" port, blocking the next iteration for one extra clock cycle. Such a situation would be self-correcting in the next clock cycle but would remain oscillatory.

      • If somebody has access to the Intel performance counters, he should look at the events <$c$c>UOPS_EXECUTED_PORT.PORT_[0-7]. If oscillation is not occuring, all ports that are used will be pegged equally during the relevant stretch of time; Else if oscillation is occuring, there will be a 50% split. Especially important is to look at the ports Mystical pointed out (0, 1, 6 and 7).



And here's what I think is not happening:


  • I don't believe that the fused arithmetic+branch uop is blocking execution by going to port 0, since $p$pdicted-taken branches are sent exclusively to port 6 (see Agner Fog's Instruction Tables under <$c$c>Haswell -> Control transfer instructions). After a few iterations of the loop above, the branch $p$pdictor will learn that this branch is a loop and always $p$pdict as taken.

I believe this is a problem that will be solved with Intel's performance counters.

I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors

float triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    for(int i=0; i<n; i++) {
        z[i] = x[i] + k*y[i];
    }
}

This is the triad function from STREAM.

I get about 95% of the peak with SandyBridge/IvyBridge processors with this function (using assembly with NASM). However, using Haswell I only achieve 62% of the peak unless I unroll the loop. If I unroll 16 times I get 92%. I don't understand this.

I decided to write my function in assembly using NASM. The main loop in assembly looks like this.

.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2

It turns out in Agner Fog's Optimizing Assembly manual in examples 12.7-12.11 he does almost the same thing (but for y[i] = y[i] +k*x[i]) for the Pentium M, Core 2, Sandy Bridge, FMA4, and FMA3. I managed to reproduce his code more or less on my own (actually he has a small bug in the FMA3 example when he broadcasts). He gives instruction size counts, fused ops , execution ports in tables for each processor except for FMA4 and FMA3. I have tried to make this table myself for FMA3.

                                 ports
             size   μops-fused   0   1   2   3   4   5   6   7    
vmovaps      5      1                    ½   ½
vfmadd231ps  6      1            ½   ½   ½   ½
vmovaps      5      1                            1           1
add          4      ½                                    ½
jne          2      ½                                    ½
--------------------------------------------------------------
total       22      4            ½   ½   1   1   1   0   1   1

Size refers to the instruction length in bytes. The reason the add and jne instructions have half a μop is they get fused into one macro-op (not to be confused with μop fusion which still uses multiple ports) and only need port 6 and one μop. The vfmadd231ps instruction can use port 0 or port 1. I chose port 0. The load vmovaps can use port 2 or 3. I chose 2 and had vfmadd231ps use port 3.. In order to be consistent with Agner Fog's tables and since I think it makes more sense to say an instruction which can go to different ports equally goes to each one 1/2 of the time, I assigned 1/2 for the ports vmovaps and vmadd231ps can go to.

Based on this table and the fact that all Core2 processors can do four μops every clock cycle it appears this loop should be possible every clock cycle but I have not managed to obtain it. Can somebody please explain to me why I can't get close to the peak bandwidth for this function on Haswell without unrolling? Is this possible without unrolling and if so how can it be done? Let me be clear that I'm really trying to maximize the ILP for this function (I don't only want maximum bandwidth) so that's the reason I don't want to unroll.

Edit: Here is an update since Iwillnotexist Idonotexist showed using IACA that the stores never use port 7. I managed to break the 66% barrier without unrolling and do this in one clock cycle every iteration without unrolling(theoretically). Let's first address the store problem.

Stephen Canon mentioned in at comment that the Address Generation Unit (AGU) in port 7 can only handle simple operations such as [base + offset] and not [base + index]. In the Intel optimization reference manual the only thing I found was a comment on port7 which says "Simple_AGU" with no definition of what simple means. But then Iwillnotexist Idonotexist found in the comments of IACA that this problem was already mentioned six months ago in which an employee at Intel wrote on 03/11/2014:

Port7 AGU can only work on stores with simple memory address (no index register).

Stephen Canon suggests "using the store address as the offset for the load operands." I have tried this like this

vmovaps         ymm1, [rdi + r9 + 32*i]
vfmadd231ps     ymm1, ymm2, [rsi + r9 + 32*i]
vmovaps         [r9 + 32*i], ymm1
add             r9, 32*unroll
cmp             r9, rcx
jne             .L2

This indeed causes the store to use port7. However, it has another problem which is that the the vmadd231ps does not fuse with the load which you can see from IACA. It also needs additionally the cmp instruction which my original function did not. So the store uses one less micro-op but the cmp (or rather then add since the cmp macro fuses with the jne) needs one more. IACA reports a block throughput of 1.5. In practice this only get about 57% of the peak.

But I found a way to get the vmadd231ps instruction to fuse with the load as well. This can only be done using static arrays with addressing [absolute 32-bit address + index] like this. Evgeny Kluev original suggested this.

vmovaps         ymm1, [src1_end + rax]
vfmadd231ps     ymm1, ymm2, [src2_end + rax]
vmovaps         [dst_end + rax], ymm1
add             rax, 32
jl              .L2

Where src1_end, src2_end, and dst_end are the end addresses of static arrays.

This reproduces the table in my question with four fused micro-ops that I expected. If you put this into IACA it reports a block throughput of 1.0. In theory this should do as well as the SSE and AVX versions. In practice it gets about 72% of the peak. That breaks the 66% barrier but it's still a long ways from the 92% I get unrolling 16 times. So on Haswell the only option to get close to the peak is to unroll. This is not necessary on Core2 through Ivy Bridge but it is on Haswell.

End_edit:

Here is the C/C++ Linux code to test this. The NASM code is posted after the C/C++ code. The only thing you have to change is the frequency number. In the line double frequency = 1.3; replace 1.3 with whatever the operating (not nominal) frequency of your processors is (which in case for a i5-4250U with turbo disabled in the BIOS is 1.3 GHz).

Compile with

nasm -f elf64 triad_sse_asm.asm
nasm -f elf64 triad_avx_asm.asm
nasm -f elf64 triad_fma_asm.asm
g++ -m64 -lrt -O3 -mfma  tests.cpp triad_fma_asm.o -o tests_fma
g++ -m64 -lrt -O3 -mavx  tests.cpp triad_avx_asm.o -o tests_avx
g++ -m64 -lrt -O3 -msse2 tests.cpp triad_sse_asm.o -o tests_sse

The C/C++ code

#include <x86intrin.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#define TIMER_TYPE CLOCK_REALTIME

extern "C" float triad_sse_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_sse_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);    
extern "C" float triad_avx_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_avx_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat); 
extern "C" float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_fma_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);

#if (defined(__FMA__))
float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_fmadd_ps(k4, _mm256_load_ps(&y[i]), _mm256_load_ps(&x[i])));
        }
    }
}
#elif (defined(__AVX__))
float triad_avx_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
        }
    }
}
#else
float triad_sse_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m128 k4 = _mm_set1_ps(k);
        for(i=0; i<n; i+=4) {
            _mm_store_ps(&z[i], _mm_add_ps(_mm_load_ps(&x[i]), _mm_mul_ps(k4, _mm_load_ps(&y[i]))));
        }
    }
}
#endif

double time_diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return (double)temp.tv_sec +  (double)temp.tv_nsec*1E-9;
}

int main () {
    int bytes_per_cycle = 0;
    double frequency = 1.3;  //Haswell
    //double frequency = 3.6;  //IB
    //double frequency = 2.66;  //Core2
    #if (defined(__FMA__))
    bytes_per_cycle = 96;
    #elif (defined(__AVX__))
    bytes_per_cycle = 48;
    #else
    bytes_per_cycle = 24;
    #endif
    double peak = frequency*bytes_per_cycle;

    const int n =2048;

    float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float);
    char *c = b+n*sizeof(float);

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;

    for(int i=0; i<n; i++) {
        x[i] = 1.0f*i;
        y[i] = 1.0f*i;
        z[i] = 0;
    }
    int repeat = 1000000;
    timespec time1, time2;
    #if (defined(__FMA__))
    triad_fma_repeat(x,y,z2,n,repeat);
    #elif (defined(__AVX__))
    triad_avx_repeat(x,y,z2,n,repeat);
    #else
    triad_sse_repeat(x,y,z2,n,repeat);
    #endif

    while(1) {
        double dtime, rate;

        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll1     rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_repeat(x,y,z,n,repeat);
        #else
        triad_sse_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("intrinsic   rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat_unroll16(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat_unroll16(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat_unroll16(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll16    rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
    }
}

The NASM code using the System V AMD64 ABI.

triad_fma_asm.asm:

global triad_fma_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]
pi: dd 3.14159
;align 16
section .text
    triad_fma_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_fma_asm_repeat_unroll16
section .text
    triad_fma_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]  
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 32
    %assign i 0 
    %rep    unroll
        vmovaps         ymm1, [r9 + 32*i]
        vfmadd231ps     ymm1, ymm2, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1 
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

triad_ava_asm.asm:

global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat2:
    shl             rcx, 2  
    vbroadcastss    ymm2, [rel pi]

align 16
.L1:
    xor             rax, rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             eax, 32
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat_unroll16
align 16
section .text
    triad_avx_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]  
align 16
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
align 16
.L2:
    %assign unroll 16
    %assign i 0 
    %rep    unroll
        vmulps          ymm1, ymm2, [r9 + 32*i]
        vaddps          ymm1, ymm1, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1 
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

triad_sse_asm.asm:

global triad_sse_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
    ;neg                rcx 
align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             rax, 16
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

global triad_sse_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat2:
    shl             rcx, 2  
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
align 16
.L1:
    xor             rax, rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             eax, 16
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret



global triad_sse_asm_repeat_unroll16
section .text
    triad_sse_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 8
    %assign i 0 
    %rep    unroll
        movaps          xmm1, [r9 + 16*i]
        mulps           xmm1, xmm2,
        addps           xmm1, [r10 + 16*i]
        movaps          [r11 + 16*i], xmm1
    %assign i i+1 
    %endrep
    add             r9, 16*unroll
    add             r10, 16*unroll
    add             r11, 16*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

解决方案

IACA Analysis

Using IACA (the Intel Architecture Code Analyzer) reveals that macro-op fusion is indeed occurring, and that it is not the problem. It is Mysticial who is correct: The problem is that the store isn't using Port 7 at all.

IACA reports the following:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6

In particular, the reported block throughput in cycles (1.5) jives very well with an efficiency of 66%.

A post on IACA's own website about this very phenomenon on Tue, 03/11/2014 - 12:39 was met by this reply by an Intel employee on Tue, 03/11/2014 - 23:20:

Port7 AGU can only work on stores with simple memory address (no index register). This is why the above analysis doesn't show port7 utilization.

This firmly settles why Port 7 wasn't being used.

Now, contrast the above with a 32x unrolled loop (it turns out unroll16 shoudl actually be called unroll32):

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 32.00 Cycles       Throughput Bottleneck: PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA, Port4, Port7

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 16.0   0.0  | 16.0 | 32.0   32.0 | 32.0   32.0 | 32.0 | 2.0  | 2.0  | 32.0 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x20]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x20]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x20], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x40]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x40]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x40], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x60]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x60]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x60], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x80]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x80]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x80], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xa0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xa0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xa0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xc0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xc0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xc0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xe0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xe0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xe0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x100]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x100]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x100], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x120]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x120]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x120], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x140]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x140]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x140], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x160]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x160]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x160], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x180]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x180]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x180], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x200]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x200]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x200], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x220]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x220]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x220], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x240]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x240]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x240], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x260]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x260]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x260], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x280]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x280]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x280], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x300]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x300]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x300], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x320]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x320]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x320], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x340]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x340]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x340], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x360]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x360]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x360], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x380]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x380]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x380], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3e0], ymm1
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r9, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | add r10, 0x400
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r11, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp r9, rcx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffcaf
Total Num Of Uops: 164

We see here micro-fusion and correct scheduling of the store to Port 7.

Manual Analysis (see edit above)

I can now answer the second of your questions: Is this possible without unrolling and if so how can it be done?. The answer is no.

I padded the arrays x, y and z to the left and right with plenty of buffer for the below experiment, and changed the inner loop to the following:

.L2:
vmovaps         ymm1, [rdi+rax] ; 1L
vmovaps         ymm0, [rsi+rax] ; 2L
vmovaps         [rdx+rax], ymm2 ; S1
add             rax, 32         ; ADD
jne             .L2             ; JMP

This intentionally does not use FMA (only loads and stores) and all load/store instructions have no dependencies, since there should therefore be no hazards whatever preventing their issue into any execution ports.

I then tested every single permutation of the first and second loads (1L and 2L), the store (S1) and the add (A) while leaving the conditional jump (J) at the end, and for each of these I tested every possible combination of offsets of x, y and z by 0 or -32 bytes (to correct for the fact that reordering the add rax, 32 before one of the r+r indexes would cause the load or store to target the wrong address). The loop was aligned to 32 bytes. The tests were run on a 2.4GHz i7-4700MQ with TurboBoost disabled by means of echo '0' > /sys/devices/system/cpu/cpufreq/boost under Linux, and using 2.4 for the frequency constant. Here are the efficiency results (maximum of 24):

Cases: 0           1           2           3           4           5           6           7
       L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   
       -0  -0  -0  -0  -0  -32 -0  -32 -0  -0  -32 -32 -32 -0  -0  -32 -0  -32 -32 -32 -0  -32 -32 -32
       ________________________________________________________________________________________________
12SAJ: 65.34%      65.34%      49.63%      65.07%      49.70%      65.05%      49.22%      65.07%
12ASJ: 48.59%      64.48%      48.74%      49.69%      48.75%      49.69%      48.99%      48.60%
1A2SJ: 49.69%      64.77%      48.67%      64.06%      49.69%      49.69%      48.94%      49.69%
1AS2J: 48.61%      64.66%      48.73%      49.71%      48.77%      49.69%      49.05%      48.74%
1S2AJ: 49.66%      65.13%      49.49%      49.66%      48.96%      64.82%      49.02%      49.66%
1SA2J: 64.44%      64.69%      49.69%      64.34%      49.69%      64.41%      48.75%      64.14%
21SAJ: 65.33%*     65.34%      49.70%      65.06%      49.62%      65.07%      49.22%      65.04%
21ASJ: Hypothetically =12ASJ
2A1SJ: Hypothetically =1A2SJ
2AS1J: Hypothetically =1AS2J
2S1AJ: Hypothetically =1S2AJ
2SA1J: Hypothetically =1SA2J
S21AJ: 48.91%      65.19%      49.04%      49.72%      49.12%      49.63%      49.21%      48.95%
S2A1J: Hypothetically =S1A2J
SA21J: Hypothetically =SA12J
SA12J: 64.69%      64.93%      49.70%      64.66%      49.69%      64.27%      48.71%      64.56%
S12AJ: 48.90%      65.20%      49.12%      49.63%      49.03%      49.70%      49.21%*     48.94%
S1A2J: 49.69%      64.74%      48.65%      64.48%      49.43%      49.69%      48.66%      49.69%
A2S1J: Hypothetically =A1S2J
A21SJ: Hypothetically =A12SJ
A12SJ: 64.62%      64.45%      49.69%      64.57%      49.69%      64.45%      48.58%      63.99%
A1S2J: 49.72%      64.69%      49.72%      49.72%      48.67%      64.46%      48.95%      49.72%
AS21J: Hypothetically =AS21J
AS12J: 48.71%      64.53%      48.76%      49.69%      48.76%      49.74%      48.93%      48.69%

We can notice a few things from the table:

  • Several plateaux of results, but two main ones only: Just under 50% and around 65%.
  • L1 and L2 can permute freely between each other without affecting the result.
  • Offsetting the accesses by -32 bytes can change efficiency.
  • The patterns we are interested in (Load 1, Load 2, Store 1 and Jump with the Add anywhere around them and the -32 offsets properly applied) are all the same, and all in the higher plateau:
    • 12SAJ Case 0 (No offsets applied), with efficiency 65.34% (the highest)
    • 12ASJ Case 1 (S-32), with efficiency 64.48%
    • 1A2SJ Case 3 (2L-32, S-32), with efficiency 64.06%
    • A12SJ Case 7 (1L-32, 2L-32, S-32), with efficiency 63.99%
  • There always exists at least one "case" for every permutation that allows execution at the higher plateau of efficiency. In particular, Case 1 (where S-32) seems to guarantee this.
  • Cases 2, 4 and 6 guarantee execution at the lower plateau. They have in common that either or both of the loads are offset by -32 while the store isn't.
  • For cases 0, 3, 5 and 7, it depends on the permutation.

Whence we may draw at least a few conclusions:

  • Execution ports 2 and 3 really don't care which load address they generate and load from.
  • Macro-op fusion of the add and jmp appears unimpacted by any permutation of the instructions (in particular under Case 1 offsetting), leading me to believe that @Evgeny Kluev's conclusion is incorrect: The distance of the add from the jne does not appear to impact their fusion. I'm reasonably certain now that the Haswell ROB handles this correctly.
    • What Evgeny was seeing (Going from 12SAJ with efficiency 65% to the others with efficiency 49% within Case 0) was an effect due solely to the value of the addresses loaded and stored from, and not due to an inability of the core to macro-fuse the add and branch.
    • Further, macro-op fusion must be occurring at least some of the time, since the average loop time is 1.5 CC. If macro-op fusion did not occur this would be 2CC minimum.
  • Having tested all valid and invalid permutations of instructions within the not-unrolled loop, we've seen nothing higher than 65.34%. This answers empirically with a "no" the question of whether it is possible to use the full bandwidth without unrolling.

I will hypothesize several possible explanations:

  • We're seeing some wierd perversion due to the value of the addresses relative to each other.
    • If so then there would exist a set of offsets of x, y and z that would allow maximum throughput. Quick random tests on my part seem not to support this.
  • We're seeing the loop run in one-two-step mode; The loop iterations alternate running in one clock cycle, then two.

    • This could be macro-op fusion being affected by the decoders. From Agner Fog:

      Fuseable arithmetic/logic instructions cannot be decoded in the last of the four decoders on Sandy Bridge and Ivy Bridge processors. I have not tested whether this also applies to the Haswell.

    • Alternately, every other clock cycle an instruction is issued to the "wrong" port, blocking the next iteration for one extra clock cycle. Such a situation would be self-correcting in the next clock cycle but would remain oscillatory.
      • If somebody has access to the Intel performance counters, he should look at the events UOPS_EXECUTED_PORT.PORT_[0-7]. If oscillation is not occuring, all ports that are used will be pegged equally during the relevant stretch of time; Else if oscillation is occuring, there will be a 50% split. Especially important is to look at the ports Mystical pointed out (0, 1, 6 and 7).

And here's what I think is not happening:

  • I don't believe that the fused arithmetic+branch uop is blocking execution by going to port 0, since predicted-taken branches are sent exclusively to port 6 (see Agner Fog's Instruction Tables under Haswell -> Control transfer instructions). After a few iterations of the loop above, the branch predictor will learn that this branch is a loop and always predict as taken.

I believe this is a problem that will be solved with Intel's performance counters.

这篇关于获取有关的Haswell峰值带宽的L1缓存:只获得62%的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆