在 L1 缓存中的 Haswell 上获得峰值带宽:仅获得 62% [英] Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

查看：17 发布时间：2021/12/18 8:51:53 c memory assembly nasm fma

本文介绍了在 L1 缓存中的 Haswell 上获得峰值带宽:仅获得 62%的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在 L1 缓存中为英特尔处理器上的以下功能获取全部带宽

float triad(float *x, float *y, float *z, const int n) {浮动 k = 3.14159f；for(int i=0; i<n; i++) {z[i] = x[i] + k*y[i];}}

这是来自 STREAM 的三元组函数.

我通过 SandyBridge/IvyBridge 处理器获得了大约 95% 的峰值(使用带有 NASM 的汇编).但是，除非展开循环，否则使用 Haswell 我只能达到峰值的 62%.如果我展开 16 次，我得到 92%.我不明白这个.

我决定使用 NASM 在汇编中编写我的函数.汇编中的主循环如下所示.

.L2:vmovaps ymm1, [rdi+rax]vfmadd231ps ymm1, ymm2, [rsi+rax]vmovaps [rdx+rax], ymm1添加 rax, 32jne .L2

结果在 Agner Fog 的优化装配手册中的示例 12.7-12.11 中，他做了几乎相同的事情(但对于 y[i] = y[i] +k*x[i]) 适用于 Pentium M、Core 2、Sandy Bridge、FMA4 和 FMA3.我设法或多或少地自己复制了他的代码(实际上他在广播时在 FMA3 示例中有一个小错误).除了 FMA4 和 FMA3，他在表格中给出了每个处理器的指令大小计数、融合操作、执行端口.我曾尝试为 FMA3 自己制作这张桌子.

 端口大小 μops 融合 0 1 2 3 4 5 6 7vmovaps 5 1 ½ ½vfmadd231ps 6 1 ½ ½ ½ ½vmovaps 5 1 1 1添加 4 ½ ½jne 2 ½ ½--------------------------------------------------------------总计 22 4 ½ ½ 1 1 1 0 1 1

Size 是指以字节为单位的指令长度.add 和 jne 指令有半个 μop 的原因是它们被融合成一个宏操作(不要与仍然使用多个端口的 μop 融合混淆)并且仅需要端口 6 和一个 μop.~~vfmadd231ps指令可以使用端口0或端口1.我选择了端口0.负载vmovaps可以使用端口2或3.我选择了2并且有vfmadd231ps 使用端口 3.~~.为了与 Agner Fog 的表保持一致，并且因为我认为可以说一条可以去不同端口的指令平均分配给每个 1/2 的时间更有意义，我为端口分配了 1/2 vmovaps 和 vmadd231ps 可以去.

基于此表以及所有 Core2 处理器每个时钟周期可以执行 4 μop 的事实，似乎每个时钟周期都应该可以执行此循环，但我还没有设法获得它.有人可以向我解释为什么我无法在不展开的情况下接近 Haswell 上此功能的峰值带宽吗?这是否可能不展开，如果可以，怎么做? 让我清楚，我真的在尝试最大化此功能的 ILP(我不仅想要最大带宽)所以这就是我的原因不想展开.

这是一个更新，因为 Iwillnotexist Idonotexist 使用 IACA 显示商店从不使用端口 7.我设法在不展开的情况下打破了 66% 的障碍，并且在每次迭代时在一个时钟周期内执行此操作而不展开(理论上).我们先解决存储问题.

Stephen Canon 在评论中提到，端口 7 中的地址生成单元 (AGU) 只能处理诸如 [base + offset] 之类的简单操作，而不能处理 [base + index].在英特尔优化参考手册中我发现的唯一一件事是对 port7 的评论，上面写着Simple_AGU"，但没有定义简单的含义.但是后来在 IACA 的评论中发现 Iwillnotexist Idonotexist 已经提到了这个问题六个月前，英特尔的一名员工在 2014 年 3 月 11 日写道:

<块引用>Port7 AGU 只能作用于具有简单内存地址的存储(无索引寄存器).
Stephen Canon 建议使用存储地址作为加载操作数的偏移量".我试过这样
vmovaps ymm1, [rdi + r9 + 32*i]vfmadd231ps ymm1, ymm2, [rsi + r9 + 32*i]vmovaps [r9 + 32*i], ymm1添加 r9, 32*展开cmp r9, rcxjne .L2
这确实导致商店使用port7.但是，它还有另一个问题，即 vmadd231ps 没有与您可以从 IACA 看到的负载融合.它还需要额外的 cmp 指令，而我的原始函数没有.因此，商店使用的微操作少了一个，但是 cmp(或者更确切地说是 add，因为 cmp 宏与 jne) 还需要一个.IACA 报告的块吞吐量为 1.5.在实践中，这只能达到峰值的 57%.
但我找到了一种方法让 vmadd231ps 指令也与负载融合.这只能使用静态数组来完成，像这样寻址 [绝对 32 位地址 + 索引].Evgeny Kluev 原建议.
vmovaps ymm1, [src1_end + rax]vfmadd231ps ymm1, ymm2, [src2_end + rax]vmovaps [dst_end + rax], ymm1添加 rax, 32.L2
其中src1_end、src2_end和dst_end是静态数组的结束地址.
这用我预期的四个融合微操作重现了我的问题中的表格.如果您将其放入 IACA，它会报告 1.0 的块吞吐量.从理论上讲，这应该与 SSE 和 AVX 版本一样好.在实践中，它大约达到峰值的 72%.这打破了 66% 的障碍，但距离我展开 16 次的 92% 仍有很长的路要走.因此，在 Haswell 上，接近顶峰的唯一选择是展开.这在 Core2 上通过 Ivy Bridge 不是必需的，但在 Haswell 上是必需的.
End_edit:
这是用于测试的 C/C++ Linux 代码.NASM 代码张贴在 C/C++ 代码之后.您唯一需要更改的是频率数.在 double frequency = 1.3; 行中，将 1.3 替换为处理器的任何操作(非标称)频率(如果 i5-4250U 在 BIOS 中禁用涡轮增压，则为 1.3 GHz).
编译
nasm -f elf64 triad_sse_asm.asmnasm -f elf64 triad_avx_asm.asmnasm -f elf64 triad_fma_asm.asmg++ -m64 -lrt -O3 -mfma tests.cpp triad_fma_asm.o -o tests_fmag++ -m64 -lrt -O3 -mavx tests.cpp triad_avx_asm.o -o tests_avxg++ -m64 -lrt -O3 -msse2 tests.cpp triad_sse_asm.o -o tests_sse
C/C++ 代码
#include #include #include #include #define TIMER_TYPE CLOCK_REALTIMEextern "C" float triad_sse_asm_repeat(float *x, float *y, float *z, const int n, int repeat);extern "C" float triad_sse_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);extern "C" float triad_avx_asm_repeat(float *x, float *y, float *z, const int n, int repeat);extern "C" float triad_avx_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);extern "C" float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);extern "C" float triad_fma_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);#if(定义(__FMA__))float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat) {浮动 k = 3.14159f；国际r;for(r=0;r
使用 System V AMD64 ABI 的 NASM 代码.
triad_fma_asm.asm:
全局 triad_fma_asm_repeat;RDI x, RSI y, RDX z, RCX n, R8 重复;z[i] = y[i] + 3.14159*x[i]圆周率:dd 3.14159;对齐16节.texttriad_fma_asm_repeat:shl rcx, 2添加 rdi, rcx添加 rsi, rcx添加 rdx, rcxvbroadcastss ymm2, [rel pi];否定 rcx对齐 16.L1:mov rax, rcx否定对齐 16.L2:vmovaps ymm1, [rdi+rax]vfmadd231ps ymm1, ymm2, [rsi+rax]vmovaps [rdx+rax], ymm1添加 rax, 32jne .L2子 r8d, 1.L1上图退全局 triad_fma_asm_repeat_unroll16节.texttriad_fma_asm_repeat_unroll16:shl rcx, 2添加 rcx, rdivbroadcastss ymm2, [rel pi].L1:异或 rax, raxmov r9, rdimov r10, rsi移动 r11, rdx.L2:%assign 展开 32%分配我 0%rep 展开vmovaps ymm1, [r9 + 32*i]vfmadd231ps ymm1, ymm2, [r10 + 32*i]vmovaps [r11 + 32*i], ymm1%assign i i+1%endrep添加 r9, 32*展开添加 r10, 32*展开添加 r11, 32*展开cmp r9, rcxjne .L2子 r8d, 1.L1上图退
triad_ava_asm.asm:
全局 triad_avx_asm_repeat;RDI x, RSI y, RDX z, RCX n, R8 重复圆周率:dd 3.14159对齐 16节.texttriad_avx_asm_repeat:shl rcx, 2添加 rdi, rcx添加 rsi, rcx添加 rdx, rcxvbroadcastss ymm2, [rel pi];否定 rcx对齐 16.L1:mov rax, rcx否定对齐 16.L2:vmulps ymm1, ymm2, [rdi+rax]vaddps ymm1, ymm1, [rsi+rax]vmovaps [rdx+rax], ymm1添加 rax, 32jne .L2子 r8d, 1.L1上图退全局 triad_avx_asm_repeat2;RDI x, RSI y, RDX z, RCX n, R8 重复;pi: dd 3.14159对齐 16节.texttriad_avx_asm_repeat2:shl rcx, 2vbroadcastss ymm2, [rel pi]对齐 16.L1:异或 rax, rax对齐 16.L2:vmulps ymm1, ymm2, [rdi+rax]vaddps ymm1, ymm1, [rsi+rax]vmovaps [rdx+rax], ymm1添加 eax, 32cmp eax, ecxjne .L2子 r8d, 1.L1上图退全局 triad_avx_asm_repeat_unroll16对齐 16节.texttriad_avx_asm_repeat_unroll16:shl rcx, 2添加 rcx, rdivbroadcastss ymm2, [rel pi]对齐 16.L1:异或 rax, raxmov r9, rdimov r10, rsi移动 r11, rdx对齐 16.L2:%assign 展开 16%分配我 0%rep 展开vmulps ymm1, ymm2, [r9 + 32*i]vaddps ymm1, ymm1, [r10 + 32*i]vmovaps [r11 + 32*i], ymm1%assign i i+1%endrep添加 r9, 32*展开添加 r10, 32*展开添加 r11, 32*展开cmp r9, rcxjne .L2子 r8d, 1.L1上图退
triad_sse_asm.asm:
全局 triad_sse_asm_repeat;RDI x, RSI y, RDX z, RCX n, R8 重复圆周率:dd 3.14159;对齐16节.texttriad_sse_asm_repeat:shl rcx, 2添加 rdi, rcx添加 rsi, rcx添加 rdx, rcxmovss xmm2, [rel pi]shufps xmm2, xmm2, 0;否定 rcx对齐 16.L1:mov rax, rcx否定对齐 16.L2:movaps xmm1, [rdi+rax]mulps xmm1, xmm2addps xmm1, [rsi+rax]movaps [rdx+rax], xmm1添加 rax, 16jne .L2子 r8d, 1.L1退全局 triad_sse_asm_repeat2;RDI x, RSI y, RDX z, RCX n, R8 重复;pi: dd 3.14159;对齐16节.texttriad_sse_asm_repeat2:shl rcx, 2movss xmm2, [rel pi]shufps xmm2, xmm2, 0对齐 16.L1:异或 rax, rax对齐 16.L2:movaps xmm1, [rdi+rax]mulps xmm1, xmm2addps xmm1, [rsi+rax]movaps [rdx+rax], xmm1添加 eax, 16cmp eax, ecxjne .L2子 r8d, 1.L1退global triad_sse_asm_repeat_unroll16节.texttriad_sse_asm_repeat_unroll16:shl rcx, 2添加 rcx, rdimovss xmm2, [rel pi]shufps xmm2, xmm2, 0.L1:异或 rax, raxmov r9, rdimov r10, rsi移动 r11, rdx.L2:%assign 展开 8%分配我 0%rep 展开movaps xmm1, [r9 + 16*i]mulps xmm1, xmm2,addps xmm1, [r10 + 16*i]movaps [r11 + 16*i], xmm1%assign i i+1%endrep添加 r9, 16*展开添加 r10, 16*展开添加 r11, 16*展开cmp r9, rcxjne .L2子 r8d, 1.L1退
 解决方案 
IACA 分析
使用IACA(英特尔架构代码分析器)揭示了宏-op 融合确实正在发生，而且这不是问题.Mysticial 是正确的:问题是商店根本没有使用端口 7.
IACA 报告如下:
英特尔(R) 架构代码分析器版本 - 2.1分析文件 - ../../../tests_fma二进制格式 - 64Bit建筑-HSW分析类型 - 吞吐量吞吐量分析报告--------------------------块吞吐量:1.55 周期吞吐量瓶颈:前端、PORT2_AGU、PORT3_AGU每次迭代周期中的端口绑定:---------------------------------------------------------------------------------------|港口 |0 - DV |1 |2 - D |3-D |4 |5 |6 |7 |---------------------------------------------------------------------------------------|周期 |0.5 0.0 |0.5 |1.5 1.0 |1.5 1.0 |1.0 |0.0 |1.0 |0.0 |---------------------------------------------------------------------------------------N - 端口号或循环数资源冲突导致延迟，DV - 分频器管道(在端口 0 上)D - 数据获取管道(在端口 2 和 3 上)，CP - 在关键路径上F - 发生与前一条指令的宏融合* - 指令微操作未绑定到端口^ - 发生了微融合# - 发出 ESP 跟踪同步 uop@ - SSE 指令跟在一条 AVX256 指令之后，预计会有几十个周期的惩罚！- 指令不受支持，未在分析中考虑|数量 |端口压力循环|||运维 |0 - DV |1 |2 - D |3-D |4 |5 |6 |7 ||---------------------------------------------------------------------------------|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [rdi+rax*1]|2 |0.5 |0.5 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]|2 |||0.5 |0.5 |1.0 ||||CP |vmovaps ymmword ptr [rdx+rax*1], ymm1|1 |||||||1.0 |||添加 rax，0x20|0F ||||||||||jnz 0xffffffffffffffecUop 总数:6
特别是，报告的周期 (1.5) 中的块吞吐量非常好，效率为 66%.
IACA 自己的网站上的一篇关于  2014 年 3 月 11 日星期二 - 12:39 在 2014 年 3 月 11 日星期二 - 23:20 上收到了英特尔员工的回复:
<块引用><块引用>Port7 AGU 只能在具有简单内存地址(无索引寄存器)的存储上工作.这就是为什么上面的分析没有显示 port7 的利用率.
这坚定地解决了为什么没有使用端口 7.
现在，对比上面的 32x 展开循环(结果 unroll16 实际上应该被称为 unroll32):
英特尔(R) 架构代码分析器版本 - 2.1分析文件 - ../../../tests_fma二进制格式 - 64Bit建筑-HSW分析类型 - 吞吐量吞吐量分析报告--------------------------块吞吐量:32.00 周期吞吐量瓶颈:PORT2_AGU、Port2_DATA、PORT3_AGU、Port3_DATA、Port4、Port7每次迭代周期中的端口绑定:---------------------------------------------------------------------------------------|港口 |0 - DV |1 |2 - D |3-D |4 |5 |6 |7 |---------------------------------------------------------------------------------------|周期 |16.0 0.0 |16.0 |32.0 32.0 |32.0 32.0 |32.0 |2.0 |2.0 |32.0 |---------------------------------------------------------------------------------------N - 端口号或循环数资源冲突导致延迟，DV - 分频器管道(在端口 0 上)D - 数据获取管道(在端口 2 和 3 上)，CP - 在关键路径上F - 发生与前一条指令的宏融合* - 指令微操作未绑定到端口^ - 发生了微融合# - 发出 ESP 跟踪同步 uop@ - SSE 指令跟在一条 AVX256 指令之后，预计会有几十个周期的惩罚！- 指令不受支持，未在分析中考虑|数量 |端口压力循环|||运维 |0 - DV |1 |2 - D |3-D |4 |5 |6 |7 ||---------------------------------------------------------------------------------|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x20]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x20]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x20], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x40]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x40]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x40], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x60]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x60]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x60], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x80]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x80]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x80], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0xa0]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xa0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0xa0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0xc0]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xc0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0xc0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0xe0]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xe0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0xe0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x100]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x100]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x100], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x120]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x120]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x120], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x140]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x140]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x140], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x160]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x160]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x160], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x180]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x180]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x180], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x1a0]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1a0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x1a0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x1c0]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1c0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x1c0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x1e0]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1e0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x1e0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x200]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x200]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x200], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x220]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x220]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x220], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x240]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x240]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x240], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x260]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x260]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x260], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x280]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x280]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x280], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x2a0]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2a0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x2a0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x2c0]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2c0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x2c0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x2e0]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2e0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x2e0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x300]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x300]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x300], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x320]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x320]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x320], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x340]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x340]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x340], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x360]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x360]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x360], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x380]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x380]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x380], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x3a0]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3a0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x3a0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x3c0]|2^ |1.0 |||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3c0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x3c0], ymm1|1 |||1.0 1.0 ||||||CP |vmovaps ymm1, ymmword ptr [r9+0x3e0]|2^ ||1.0 ||1.0 1.0 |||||CP |vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3e0]|2^ |||||1.0 |||1.0 |CP |vmovaps ymmword ptr [r11+0x3e0], ymm1|1 ||||||1.0 ||||添加 r9, 0x400|1 |||||||1.0 |||添加 r10, 0x400|1 ||||||1.0 ||||添加 r11, 0x400|1 |||||||1.0 |||cmp r9, rcx|0F ||||||||||jnz 0xffffffffffffffcafUops 总数:164
我们在这里看到微融合和正确的商店到端口 7 的调度.
手动分析(见上面的编辑)
我现在可以回答您的第二个问题:这是否可以不展开，如果可以，怎么做?.答案是否定的.
我将数组 x、y 和 z 填充到左侧和右侧，并为下面的实验提供了足够的缓冲区，并更改了内循环如下:
.L2:vmovaps ymm1, [rdi+rax] ;1Lvmovaps ymm0, [rsi+rax] ;2Lvmovaps [rdx+rax], ymm2 ;S1添加 rax, 32 ;添加jne .L2 ;JMP
这有意不使用 FMA(仅加载和存储)，并且所有加载/存储指令都没有依赖关系，因为因此不应该有任何危险，无论阻止它们进入任何执行端口.
然后我测试了第一次和第二次加载(1L 和 2L)、存储(S1)和添加的每一个排列(A) 而在最后留下条件跳转 (J)，对于每一个我测试了 x 偏移量的所有可能组合, y 和 z 0 或 -32 字节(纠正以下事实:重新排序 add rax, 32 在 >r+r 索引会导致加载或存储指向错误的地址).循环对齐到 32 个字节.测试在 2.4GHz i7-4700MQ 上运行，通过 echo '0' > 禁用 TurboBoost.Linux下/sys/devices/system/cpu/cpufreq/boost，频率常数使用2.4.以下是效率结果(最多 24):
案例:0 1 2 3 4 5 6 7L1 L2 S L1 L2 S L1 L2 S L1 L2 S L1 L2 S L1 L2 S L1 L2 S L1 L2 S-0 -0 -0 -0 -0 -32 -0 -32 -0 -0 -32 -32 -32 -0 -0 -32 -0 -32 -32 -32 -0 -32 -32 -32________________________________________________________________________________________________12SAJ:65.34% 65.34% 49.63% 65.07% 49.70% 65.05% 49.22% 65.07%12ASJ:48.59% 64.48% 48.74% 49.69% 48.75% 49.69% 48.99% 48.60%1A2SJ:49.69% 64.77% 48.67% 64.06% 49.69% 49.69% 48.94% 49.69%1AS2J:48.61% 64.66% 48.73% 49.71% 48.77% 49.69% 49.05% 48.74%1S2AJ:49.66% 65.13% 49.49% 49.66% 48.96% 64.82% 49.02% 49.66%1SA2J:64.44% 64.69% 49.69% 64.34% 49.69% 64.41% 48.75% 64.14%21SAJ:65.33%* 65.34% 49.70% 65.06% 49.62% 65.07% 49.22% 65.04%21ASJ:假设 =12ASJ2A1SJ:假设 =1A2SJ2AS1J:假设=1AS2J2S1AJ:假设 =1S2AJ2SA1J:假设=1SA2JS21AJ:48.91% 65.19% 49.04% 49.72% 49.12% 49.63% 49.21% 48.95%S2A1J:假设 =S1A2JSA21J:假设 =SA12JSA12J:64.69% 64.93% 49.70% 64.66% 49.69% 64.27% 48.71% 64.56%S12AJ:48.90% 65.20% 49.12% 49.63% 49.03% 49.70% 49.21%* 48.94%S1A2J:49.69% 64.74% 48.65% 64.48% 49.43% 49.69% 48.66% 49.69%A2S1J:假设 =A1S2JA21SJ:假设 =A12SJA12SJ:64.62% 64.45% 49.69% 64.57% 49.69% 64.45% 48.58% 63.99%A1S2J:49.72% 64.69% 49.72% 49.72% 48.67% 64.46% 48.95% 49.72%AS21J:假设 =AS21JAS12J:48.71% 64.53% 48.76% 49.69% 48.76% 49.74% 48.93% 48.69%
我们可以从表格中注意到一些事情:
有几个稳定期的结果，但只有两个主要的:不到 50% 和大约 65%.
L1 和 L2 可以在彼此之间自由置换而不影响结果.
将访问偏移 -32 字节可以改变效率.
我们感兴趣的模式(Load 1、Load 2、Store 1 和 Jump，并在它们周围的任何地方添加 Add 并正确应用 -32 偏移)都是相同的，并且都在更高的平台上:12SAJ 案例 0(未应用偏移量)，效率为 65.34%(最高)
12ASJ Case 1 (S-32)，效率64.48%
1A2SJ Case 3 (2L-32, S-32)，效率64.06%
A12SJ Case 7 (1L-32, 2L-32, S-32)，有效率63.99%
对于允许在更高的效率平台上执行的每个排列，总是存在至少一个案例".特别是，案例 1(其中 S-32)似乎可以保证这一点.
案例 2、4 和 6 保证在较低的平台上执行.它们的共同点是，其中一个或两个负载偏移了 -32，而存储没有偏移.
对于案例 0、3、5 和 7，这取决于排列.
从这里我们至少可以得出几个结论:
执行端口 2 和 3 真的不关心它们生成和加载的加载地址.
add 和 jmp 的宏操作融合似乎不受任何指令排列的影响(特别是在情况 1 偏移下)，这让我相信@EvgenyKluev's conclusion is incorrect: The distance of the add from the jne does not appear to impact their fusion.I'm reasonably certain now that the Haswell ROB handles this correctly.What Evgeny was seeing (Going from 12SAJ with efficiency 65% to the others with efficiency 49% within Case 0) was an effect due solely to the value of the addresses loaded and stored from, and not due to an inability of the core to macro-fuse the add and branch.
Further, macro-op fusion must be occurring at least some of the time, since the average loop time is 1.5 CC. If macro-op fusion did not occur this would be 2CC minimum.
Having tested all valid and invalid permutations of instructions within the not-unrolled loop, we've seen nothing higher than 65.34%. This answers empirically with a "no" the question of whether it is possible to use the full bandwidth without unrolling.
I will hypothesize several possible explanations:
We're seeing some wierd perversion due to the value of the addresses relative to each other.If so then there would exist a set of offsets of x, y and z that would allow maximum throughput. Quick random tests on my part seem not to support this.
We're seeing the loop run in one-two-step mode; The loop iterations alternate running in one clock cycle, then two.
This could be macro-op fusion being affected by the decoders. From Agner Fog:
Fuseable arithmetic/logic instructions cannot be decoded in the last of the four decoders on Sandy Bridge and Ivy Bridge processors. I have not tested whether this also applies to the Haswell.
Alternately, every other clock cycle an instruction is issued to the "wrong" port, blocking the next iteration for one extra clock cycle. Such a situation would be self-correcting in the next clock cycle but would remain oscillatory.If somebody has access to the Intel performance counters, he should look at the events UOPS_EXECUTED_PORT.PORT_[0-7]. If oscillation is not occuring, all ports that are used will be pegged equally during the relevant stretch of time; Else if oscillation is occuring, there will be a 50% split. Especially important is to look at the ports Mystical pointed out (0, 1, 6 and 7).
And here's what I think is not happening:
I don't believe that the fused arithmetic+branch uop is blocking execution by going to port 0, since predicted-taken branches are sent exclusively to port 6 (see Agner Fog's Instruction Tables under Haswell -> Control transfer instructions). After a few iterations of the loop above, the branch predictor will learn that this branch is a loop and always predict as taken.
I believe this is a problem that will be solved with Intel's performance counters.
I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors
float triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    for(int i=0; i<n; i++) {
        z[i] = x[i] + k*y[i];
    }
}
This is the triad function from STREAM.

I get about 95% of the peak with SandyBridge/IvyBridge processors with this function (using assembly with NASM). However, using Haswell I only achieve 62% of the peak unless I unroll the loop. If I unroll 16 times I get 92%. I don't understand this.

I decided to write my function in assembly using NASM. The main loop in assembly looks like this.
.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
It turns out in Agner Fog's Optimizing Assembly manual in examples 12.7-12.11 he does almost the same thing (but for y[i] = y[i] +k*x[i]) for the Pentium M, Core 2, Sandy Bridge, FMA4, and FMA3.  I managed to reproduce his code more or less on my own (actually he has a small bug in the FMA3 example when he broadcasts). He gives instruction size counts, fused ops , execution ports in tables for each processor except for FMA4 and FMA3. I have tried to make this table myself for FMA3.
                                 ports
             size   μops-fused   0   1   2   3   4   5   6   7    
vmovaps      5      1                    ½   ½
vfmadd231ps  6      1            ½   ½   ½   ½
vmovaps      5      1                            1           1
add          4      ½                                    ½
jne          2      ½                                    ½
--------------------------------------------------------------
total       22      4            ½   ½   1   1   1   0   1   1
Size refers to the instruction length in bytes. The reason the add and jne instructions have half a μop is they get fused into one macro-op (not to be confused with μop fusion which still uses multiple ports) and only need port 6 and one μop.  The vfmadd231ps instruction can use port 0 or port 1. I chose port 0. The load vmovaps can use port 2 or 3. I chose 2 and had vfmadd231ps use port 3.. In order to be consistent with Agner Fog's tables and since I think it makes more sense to say an instruction which can go to different ports equally goes to each one 1/2 of the time, I assigned 1/2 for the ports vmovaps and vmadd231ps can go to.

Based on this table and the fact that all Core2 processors can do four μops every clock cycle it appears this loop should be possible every clock cycle but I have not managed to obtain it. Can somebody please explain to me why I can't get close to the peak bandwidth for this function on Haswell without unrolling? Is this possible without unrolling and if so how can it be done? Let me be clear that I'm really trying to maximize the ILP for this function (I don't only want maximum bandwidth) so that's the reason I don't want to unroll.

Edit:
Here is an update since Iwillnotexist Idonotexist showed using IACA that the stores never use port 7. I managed to break the 66% barrier without unrolling and do this in one clock cycle every iteration without unrolling(theoretically).  Let's first address the store problem.

Stephen Canon mentioned in at comment that the Address Generation Unit (AGU) in port 7 can only handle simple operations such as [base + offset] and not [base + index]. In the Intel optimization reference manual the only thing I found was a comment on port7 which says "Simple_AGU" with no definition of what simple means. But then Iwillnotexist Idonotexist found in the comments of IACA that this problem was already mentioned six months ago in which an employee at Intel wrote on 03/11/2014: 

  Port7 AGU can only work on stores with simple memory address (no index register).
Stephen Canon suggests "using the store address as the offset for the load operands." I have tried this like this 
vmovaps         ymm1, [rdi + r9 + 32*i]
vfmadd231ps     ymm1, ymm2, [rsi + r9 + 32*i]
vmovaps         [r9 + 32*i], ymm1
add             r9, 32*unroll
cmp             r9, rcx
jne             .L2
This indeed causes the store to use port7. However, it has another problem which is that the the vmadd231ps does not fuse with the load which you can see from IACA. It also needs additionally the cmp instruction which my original function did not. So the store uses one less micro-op but the cmp (or rather then add since the cmp macro fuses with the jne) needs one more. IACA reports a block throughput of 1.5. In practice this only get about 57% of the peak. 

But I found a way to get the vmadd231ps instruction to fuse with the load as well. This can only be done using static arrays with addressing [absolute 32-bit address + index] like this. Evgeny Kluev original suggested this.
vmovaps         ymm1, [src1_end + rax]
vfmadd231ps     ymm1, ymm2, [src2_end + rax]
vmovaps         [dst_end + rax], ymm1
add             rax, 32
jl              .L2
Where src1_end, src2_end, and dst_end are the end addresses of static arrays.

This reproduces the table in my question with four fused micro-ops that I expected. If you put this into IACA  it reports a block throughput of 1.0. In theory this should do as well as the SSE and AVX versions. In practice it gets about 72% of the peak. That breaks the 66% barrier but it's still a long ways from the 92% I get unrolling 16 times. So on Haswell the only option to get close to the peak is to unroll. This is not necessary on Core2 through Ivy Bridge but it is on Haswell. 

End_edit:

Here is the C/C++ Linux code to test this. The NASM code is posted after the C/C++ code. The only thing you have to change is the frequency number. In the line double frequency = 1.3; replace 1.3 with whatever the operating (not nominal) frequency of your processors is (which in case for a i5-4250U with turbo disabled in the BIOS is 1.3 GHz).

Compile with
nasm -f elf64 triad_sse_asm.asm
nasm -f elf64 triad_avx_asm.asm
nasm -f elf64 triad_fma_asm.asm
g++ -m64 -lrt -O3 -mfma  tests.cpp triad_fma_asm.o -o tests_fma
g++ -m64 -lrt -O3 -mavx  tests.cpp triad_avx_asm.o -o tests_avx
g++ -m64 -lrt -O3 -msse2 tests.cpp triad_sse_asm.o -o tests_sse
The C/C++ code
#include <x86intrin.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#define TIMER_TYPE CLOCK_REALTIME

extern "C" float triad_sse_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_sse_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);    
extern "C" float triad_avx_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_avx_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat); 
extern "C" float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_fma_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);

#if (defined(__FMA__))
float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_fmadd_ps(k4, _mm256_load_ps(&y[i]), _mm256_load_ps(&x[i])));
        }
    }
}
#elif (defined(__AVX__))
float triad_avx_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
        }
    }
}
#else
float triad_sse_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m128 k4 = _mm_set1_ps(k);
        for(i=0; i<n; i+=4) {
            _mm_store_ps(&z[i], _mm_add_ps(_mm_load_ps(&x[i]), _mm_mul_ps(k4, _mm_load_ps(&y[i]))));
        }
    }
}
#endif

double time_diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return (double)temp.tv_sec +  (double)temp.tv_nsec*1E-9;
}

int main () {
    int bytes_per_cycle = 0;
    double frequency = 1.3;  //Haswell
    //double frequency = 3.6;  //IB
    //double frequency = 2.66;  //Core2
    #if (defined(__FMA__))
    bytes_per_cycle = 96;
    #elif (defined(__AVX__))
    bytes_per_cycle = 48;
    #else
    bytes_per_cycle = 24;
    #endif
    double peak = frequency*bytes_per_cycle;

    const int n =2048;

    float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float);
    char *c = b+n*sizeof(float);

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;

    for(int i=0; i<n; i++) {
        x[i] = 1.0f*i;
        y[i] = 1.0f*i;
        z[i] = 0;
    }
    int repeat = 1000000;
    timespec time1, time2;
    #if (defined(__FMA__))
    triad_fma_repeat(x,y,z2,n,repeat);
    #elif (defined(__AVX__))
    triad_avx_repeat(x,y,z2,n,repeat);
    #else
    triad_sse_repeat(x,y,z2,n,repeat);
    #endif

    while(1) {
        double dtime, rate;

        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll1     rate %6.2f GB/s, efficency %6.2f%%, error %d
", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_repeat(x,y,z,n,repeat);
        #else
        triad_sse_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("intrinsic   rate %6.2f GB/s, efficency %6.2f%%, error %d
", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat_unroll16(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat_unroll16(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat_unroll16(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll16    rate %6.2f GB/s, efficency %6.2f%%, error %d
", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
    }
}
The NASM code using the System V AMD64 ABI.

triad_fma_asm.asm:
global triad_fma_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]
pi: dd 3.14159
;align 16
section .text
    triad_fma_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_fma_asm_repeat_unroll16
section .text
    triad_fma_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]  
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 32
    %assign i 0 
    %rep    unroll
        vmovaps         ymm1, [r9 + 32*i]
        vfmadd231ps     ymm1, ymm2, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1 
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret
triad_ava_asm.asm:
global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat2:
    shl             rcx, 2  
    vbroadcastss    ymm2, [rel pi]

align 16
.L1:
    xor             rax, rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             eax, 32
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat_unroll16
align 16
section .text
    triad_avx_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]  
align 16
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
align 16
.L2:
    %assign unroll 16
    %assign i 0 
    %rep    unroll
        vmulps          ymm1, ymm2, [r9 + 32*i]
        vaddps          ymm1, ymm1, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1 
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret
triad_sse_asm.asm:
global triad_sse_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
    ;neg                rcx 
align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             rax, 16
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

global triad_sse_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat2:
    shl             rcx, 2  
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
align 16
.L1:
    xor             rax, rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             eax, 16
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret



global triad_sse_asm_repeat_unroll16
section .text
    triad_sse_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 8
    %assign i 0 
    %rep    unroll
        movaps          xmm1, [r9 + 16*i]
        mulps           xmm1, xmm2,
        addps           xmm1, [r10 + 16*i]
        movaps          [r11 + 16*i], xmm1
    %assign i i+1 
    %endrep
    add             r9, 16*unroll
    add             r10, 16*unroll
    add             r11, 16*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

 解决方案 
IACA Analysis

Using IACA (the Intel Architecture Code Analyzer) reveals that macro-op fusion is indeed occurring, and that it is not the problem. It is Mysticial who is correct: The problem is that the store isn't using Port 7 at all.

IACA reports the following:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6
In particular, the reported block throughput in cycles (1.5) jives very well with an efficiency of 66%.

A post on IACA's own website about this very phenomenon on Tue, 03/11/2014 - 12:39 was met by this reply by an Intel employee on Tue, 03/11/2014 - 23:20:

  

    Port7 AGU can only work on stores with simple memory address (no index register). This is why the above analysis doesn't show port7 utilization.
  

This firmly settles why Port 7 wasn't being used.

Now, contrast the above with a 32x unrolled loop (it turns out unroll16 shoudl actually be called unroll32):
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 32.00 Cycles       Throughput Bottleneck: PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA, Port4, Port7

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 16.0   0.0  | 16.0 | 32.0   32.0 | 32.0   32.0 | 32.0 | 2.0  | 2.0  | 32.0 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x20]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x20]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x20], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x40]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x40]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x40], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x60]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x60]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x60], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x80]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x80]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x80], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xa0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xa0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xa0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xc0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xc0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xc0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xe0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xe0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xe0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x100]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x100]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x100], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x120]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x120]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x120], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x140]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x140]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x140], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x160]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x160]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x160], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x180]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x180]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x180], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x200]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x200]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x200], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x220]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x220]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x220], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x240]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x240]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x240], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x260]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x260]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x260], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x280]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x280]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x280], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x300]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x300]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x300], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x320]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x320]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x320], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x340]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x340]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x340], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x360]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x360]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x360], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x380]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x380]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x380], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3e0], ymm1
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r9, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | add r10, 0x400
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r11, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp r9, rcx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffcaf
Total Num Of Uops: 164
We see here micro-fusion and correct scheduling of the store to Port 7.

Manual Analysis (see edit above)

I can now answer the second of your questions: Is this possible without unrolling and if so how can it be done?. The answer is no.

I padded the arrays x, y and z to the left and right with plenty of buffer for the below experiment, and changed the inner loop to the following:
.L2:
vmovaps         ymm1, [rdi+rax] ; 1L
vmovaps         ymm0, [rsi+rax] ; 2L
vmovaps         [rdx+rax], ymm2 ; S1
add             rax, 32         ; ADD
jne             .L2             ; JMP
This intentionally does not use FMA (only loads and stores) and all load/store instructions have no dependencies, since there should therefore be no hazards whatever preventing their issue into any execution ports.

I then tested every single permutation of the first and second loads (1L and 2L), the store (S1) and the add (A) while leaving the conditional jump (J) at the end, and for each of these I tested every possible combination of offsets of x, y and z by 0 or -32 bytes (to correct for the fact that reordering the add rax, 32 before one of the r+r indexes would cause the load or store to target the wrong address). The loop was aligned to 32 bytes. The tests were run on a 2.4GHz i7-4700MQ with TurboBoost disabled by means of echo '0' > /sys/devices/system/cpu/cpufreq/boost under Linux, and using 2.4 for the frequency constant. Here are the efficiency results (maximum of 24):
Cases: 0           1           2           3           4           5           6           7
       L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   
       -0  -0  -0  -0  -0  -32 -0  -32 -0  -0  -32 -32 -32 -0  -0  -32 -0  -32 -32 -32 -0  -32 -32 -32
       ________________________________________________________________________________________________
12SAJ: 65.34%      65.34%      49.63%      65.07%      49.70%      65.05%      49.22%      65.07%
12ASJ: 48.59%      64.48%      48.74%      49.69%      48.75%      49.69%      48.99%      48.60%
1A2SJ: 49.69%      64.77%      48.67%      64.06%      49.69%      49.69%      48.94%      49.69%
1AS2J: 48.61%      64.66%      48.73%      49.71%      48.77%      49.69%      49.05%      48.74%
1S2AJ: 49.66%      65.13%      49.49%      49.66%      48.96%      64.82%      49.02%      49.66%
1SA2J: 64.44%      64.69%      49.69%      64.34%      49.69%      64.41%      48.75%      64.14%
21SAJ: 65.33%*     65.34%      49.70%      65.06%      49.62%      65.07%      49.22%      65.04%
21ASJ: Hypothetically =12ASJ
2A1SJ: Hypothetically =1A2SJ
2AS1J: Hypothetically =1AS2J
2S1AJ: Hypothetically =1S2AJ
2SA1J: Hypothetically =1SA2J
S21AJ: 48.91%      65.19%      49.04%      49.72%      49.12%      49.63%      49.21%      48.95%
S2A1J: Hypothetically =S1A2J
SA21J: Hypothetically =SA12J
SA12J: 64.69%      64.93%      49.70%      64.66%      49.69%      64.27%      48.71%      64.56%
S12AJ: 48.90%      65.20%      49.12%      49.63%      49.03%      49.70%      49.21%*     48.94%
S1A2J: 49.69%      64.74%      48.65%      64.48%      49.43%      49.69%      48.66%      49.69%
A2S1J: Hypothetically =A1S2J
A21SJ: Hypothetically =A12SJ
A12SJ: 64.62%      64.45%      49.69%      64.57%      49.69%      64.45%      48.58%      63.99%
A1S2J: 49.72%      64.69%      49.72%      49.72%      48.67%      64.46%      48.95%      49.72%
AS21J: Hypothetically =AS21J
AS12J: 48.71%      64.53%      48.76%      49.69%      48.76%      49.74%      48.93%      48.69%
We can notice a few things from the table:


Several plateaux of results, but two main ones only: Just under 50% and around 65%.
L1 and L2 can permute freely between each other without affecting the result.
Offsetting the accesses by -32 bytes can change efficiency.
The patterns we are interested in (Load 1, Load 2, Store 1 and Jump with the Add anywhere around them and the -32 offsets properly applied) are all the same, and all in the higher plateau:

12SAJ Case 0 (No offsets applied), with efficiency 65.34% (the highest)
12ASJ Case 1 (S-32), with efficiency 64.48%
1A2SJ Case 3 (2L-32, S-32), with efficiency 64.06%
A12SJ Case 7 (1L-32, 2L-32, S-32), with efficiency 63.99%

There always exists at least one "case" for every permutation that allows execution at the higher plateau of efficiency. In particular, Case 1 (where S-32) seems to guarantee this.
Cases 2, 4 and 6 guarantee execution at the lower plateau. They have in common that either or both of the loads are offset by -32 while the store isn't.
For cases 0, 3, 5 and 7, it depends on the permutation.


Whence we may draw at least a few conclusions:


Execution ports 2 and 3 really don't care which load address they generate and load from.
Macro-op fusion of the add and jmp appears unimpacted by any permutation of the instructions (in particular under Case 1 offsetting), leading me to believe that @Evgeny Kluev's conclusion is incorrect: The distance of the add from the jne does not appear to impact their fusion. I'm reasonably certain now that the Haswell ROB handles this correctly.

What Evgeny was seeing (Going from 12SAJ with efficiency 65% to the others with efficiency 49% within Case 0) was an effect due solely to the value of the addresses loaded and stored from, and not due to an inability of the core to macro-fuse the add and branch.
Further, macro-op fusion must be occurring at least some of the time, since the average loop time is 1.5 CC. If macro-op fusion did not occur this would be 2CC minimum.

Having tested all valid and invalid permutations of instructions within the not-unrolled loop, we've seen nothing higher than 65.34%. This answers empirically with a "no" the question of whether it is possible to use the full bandwidth without unrolling.


I will hypothesize several possible explanations:


We're seeing some wierd perversion due to the value of the addresses relative to each other.

If so then there would exist a set of offsets of x, y and z that would allow maximum throughput. Quick random tests on my part seem not to support this.

We're seeing the loop run in one-two-step mode; The loop iterations alternate running in one clock cycle, then two.


This could be macro-op fusion being affected by the decoders. From Agner Fog:

Fuseable arithmetic/logic instructions cannot be decoded in the last of the four decoders on Sandy Bridge and Ivy Bridge processors. I have not tested whether this also applies to the Haswell.
Alternately, every other clock cycle an instruction is issued to the "wrong" port, blocking the next iteration for one extra clock cycle. Such a situation would be self-correcting in the next clock cycle but would remain oscillatory.

If somebody has access to the Intel performance counters, he should look at the events UOPS_EXECUTED_PORT.PORT_[0-7]. If oscillation is not occuring, all ports that are used will be pegged equally during the relevant stretch of time; Else if oscillation is occuring, there will be a 50% split. Especially important is to look at the ports Mystical pointed out (0, 1, 6 and 7).




And here's what I think is not happening:


I don't believe that the fused arithmetic+branch uop is blocking execution by going to port 0, since predicted-taken branches are sent exclusively to port 6 (see Agner Fog's Instruction Tables under Haswell -> Control transfer instructions). After a few iterations of the loop above, the branch predictor will learn that this branch is a loop and always predict as taken.


I believe this is a problem that will be solved with Intel's performance counters.

                        这篇关于在 L1 缓存中的 Haswell 上获得峰值带宽:仅获得 62%的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在 L1 缓存中的 Haswell 上获得峰值带宽:仅获得 62% [英] Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

问题描述

`IACA 分析`

手动分析(见上面的编辑)

IACA Analysis

Manual Analysis (see edit above)

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 L1 缓存中的 Haswell 上获得峰值带宽:仅获得 62% [英] Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

问题描述

IACA 分析

手动分析(见上面的编辑)

IACA Analysis

Manual Analysis (see edit above)

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

`IACA 分析`

登录关闭