英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能 [英] Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

查看：96 发布时间：2020/5/21 20:23:55 performance assembly optimization x86 x86-64

本文介绍了英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于一个有两个存储的简单存储循环，我看到了出乎意料的糟糕性能:一个存储具有16个字节的前向步幅，另一个存储总是位于相同的位置¹，如下所示:

volatile uint32_t value;

void weirdo_cpp(size_t iters, uint32_t* output) {

    uint32_t x = value;
    uint32_t          *rdx = output;
    volatile uint32_t *rsi = output;
    do {
        *rdx    = x;
        *rsi = x;

        rdx += 4;  // 16 byte stride
    } while (--iters > 0);
}

在组装中，此循环可能³如下:

weirdo_cpp:

...

align 16
.top:
    mov    [rdx], eax  ; stride 16
    mov    [rsi], eax  ; never changes

    add    rdx, 16

    dec    rdi
    jne    .top

    ret

当访问的存储区位于L2时，我希望它每次迭代的运行时间少于3个周期.第二家商店只是保持相同的位置，应该增加一个周期.第一家商店意味着从L2引入一条线，因此也每四次迭代一次退出线.我不确定您如何评估L2成本，但是即使您保守地估计L1每个周期只能执行以下操作之一:(a)提交存储或(b)从L2接收一行或(c)往L2行驶一行，对于stride-16存储流，您将获得1 + 0.25 + 0.25 = 1.5个周期的信息.

实际上，您注释掉一个存储区时，仅第一个存储区的每次迭代获得〜1.25个周期，而第二个存储区的每次迭代获得〜1.01个周期，因此每个迭代2.5个周期似乎是一个保守的估计.

但是，实际表现却很奇怪.这是测试工具的典型运行:

Estimated CPU speed:  2.60 GHz
output size     :   64 KiB
output alignment:   32
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.89 cycles/iter,  1.49 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 4.73 cycles/iter,  1.81 ns/iter, cpu before: 0, cpu after: 0
 7.33 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.33 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.34 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.26 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.31 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.29 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.29 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.27 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.30 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.30 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0

这里有两件事很奇怪.

首先是双峰时序:有一个快速模式和一个慢速模式.我们从慢速模式开始，每次迭代大约需要7.3个周期，并在某个时候过渡到每次迭代大约3.9个周期.此行为是一致且可重现的，并且两个时间总是围绕两个值聚集在一起.从慢速模式到快速模式的两个方向都显示过渡，反之亦然(有时在一次运行中出现多个过渡).

另一个奇怪的事情是真正的糟糕表现.即使在 fast模式中，在大约3.9个周期内，性能也远比您通过将每个案例与单个存储加在一起(并假设两个商店都处于循环状态时，绝对零工作可以重叠).在慢速模式下，与您基于第一条原则所期望的性能相比，性能糟透了:2个存储要花7.3个周期，如果以L2存储带宽来计算，则大约是<每个L2存储区的strong> 29个周期(因为我们每4次迭代仅存储一条完整的缓存行).

记录为Skylake 在L1和L2之间，比在这里观察到的吞吐量高( )(在慢速模式下，大约2个字节/周期).

是什么解释了吞吐量和双峰性能不佳的原因，我可以避免吗?

我也很好奇这是否会在其他架构上甚至在其他Skylake盒子上重现.随时在评论中包含本地结果.

您可以在github上找到测试代码并加以利用.对于Linux或类似Unix的平台，有一个Makefile，但是在Windows上也应该相对容易构建.如果要运行asm变体，则需要nasm或yasm用于程序集⁴-如果没有，则可以尝试C ++版本. 4 >

排除的可能性

在这里，我考虑了一些可能性，并在很大程度上消除了它们.简单的事实消除了许多可能性，您可以在基准测试循环的中间看到性能随机变化，而当许多事情根本没有变化时(例如，如果它与输出有关)数组对齐，因为在整个运行过程中都使用相同的缓冲区，所以它在运行过程中无法更改).我在下面将其称为默认消除(即使对于默认消除而言，通常也要提出另一个论点).

对齐因子:输出数组是16字节对齐的，我已经尝试了最大2MB的对齐，而不做任何更改.也通过默认消除消除.
与机器上其他进程的竞争:在闲置的机器上甚至在负载较重的机器上(例如，使用stress -vm 4)或多或少都可以观察到相同的效果.基准本身无论如何都应该完全是核心本地的，因为它适合L2，并且perf确认每次迭代很少发生L2丢失(每300-400次迭代大约1次丢失)，可能与printf代码有关. /li>
TurboBoost:完全禁用了TurboBoost，这通过三个不同的MHz读数来确认.
节电的东西:在performance模式下，性能调节器为intel_pstate.测试期间未观察到频率变化(CPU基本上锁定在2.59 GHz上).
TLB效果:即使输出缓冲区位于2 MB的大页面中，该效果仍然存在.在任何情况下，64个4k TLB条目都超过了128K输出缓冲区. perf没有报告任何特别奇怪的TLB行为.
4k别名:此基准的较旧，更复杂的版本确实显示了一些4k别名，但是由于基准中没有没有负载(它的负载可能会错误地别名较早的存储)，因此已被消除.也通过默认消除消除.
L2关联性冲突:通过默认消除消除，并且即使在2MB页的情况下也不会消失，我们可以确定输出缓冲区在物理内存中呈线性排列
超线程效果:HT被禁用.
预取:由于所有数据都位于L1或L2中，因此此处仅涉及两个预取器("DCU"，又名L1-> L2预取器)，但是在启用或取消所有预取器的情况下，性能是相同的全部禁用.
中断:中断计数与慢速模式之间没有关联.总中断的数量有限，主要是时钟滴答.

toplev.py

我使用了 toplev.py 来实现英特尔的自上而下分析方法，毫不奇怪，它会将基准标识为商店绑定:

BE             Backend_Bound:                                                      82.11 % Slots      [  4.83%]
BE/Mem         Backend_Bound.Memory_Bound:                                         59.64 % Slots      [  4.83%]
BE/Core        Backend_Bound.Core_Bound:                                           22.47 % Slots      [  4.83%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                                 0.03 % Stalls     [  4.92%]
    This metric estimates how often the CPU was stalled without
    loads missing the L1 data cache...
    Sampling events:  mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                             74.91 % Stalls     [  4.96%] <==
    This metric estimates how often CPU was stalled  due to
    store memory accesses...
    Sampling events:  mem_inst_retired.all_stores:pp
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:                         28.20 % Clocks     [  4.93%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized:         26.28 % CoreClocks [  4.83%]
    This metric represents Core cycles fraction where the CPU
    executed total of 1 uop per cycle on all execution ports...
               MUX:                                                                 4.65 %           
    PerfMon Event Multiplexing accuracy indicator

这并没有真正说明问题:我们已经知道这一定是商店搞砸了，但是为什么呢? 英特尔对情况的描述

这是关于L1-L2相互作用涉及的一些问题的合理总结.

2019年2月更新:我不再能够重现表演的双峰"部分:对我来说，在同一个i7-6700HQ盒子上，表演现在总是在相同情况下非常慢，则应用了非常慢和非常慢的双峰性能，即每行结果大约16-20个周期，如下所示:

此更改似乎已在2018年8月的Skylake微代码更新版本0xC6中引入.先前的微代码0xC2显示了问题中描述的原始行为.

¹这是我原始循环的极大简化的MCVE，它的大小至少是其三倍，并且做了很多其他工作，但是表现出与该简单版本完全相同的性能，但存在瓶颈在同一个神秘问题上.

³特别是，如果您手动编写程序集，或者如果您使用gcc -O1(版本5.4.1)进行编译，则它看起来完全像一样，并且可能是最合理的编译器(使用volatile可以避免将大部分死掉的第二个存储区沉入循环之外).

⁴毫无疑问，您可以通过一些较小的编辑将其转换为MASM语法，因为程序集是如此的琐碎.拉取请求已接受.

解决方案

到目前为止，我发现了什么.不幸的是，它并没有真正为性能不佳提供解释，也没有为双峰分布提供任何解释，而是更多关于何时可以看到性能的规则以及减轻性能的注意事项:

进入L2的存储吞吐量似乎是每三个周期最多一个64字节高速缓存行⁰，从而使每个周期〜21字节成为存储吞吐量的上限.换句话说，在每个L1缓存行中，在L1中丢失并在L2中命中的一系列存储将至少花费三个周期.
在基线之上，当将在L2中命中的存储区与存储在不同的缓存行中的存储区进行交织时(无论这些存储区是在L1命中还是在存储区中命中) L2).
对于附近(但仍不在同一高速缓存行中)的商店，罚款显然要大一些.
双峰性能至少与上述效应表面上相关，因为在非交织情况下似乎不会发生，尽管我对此没有进一步的解释.
如果通过预取或虚拟加载确保高速缓存行已在存储之前位于L1中，则慢速性能将消失并且性能不再是双峰的.

详细信息和图片

64字节跨度

最初的问题任意使用跨度为16，但让我们从最简单的情况开始:跨度为64，即一条完整的缓存行.事实证明，任何步幅都可以看到各种效果，但是64可以确保每个步幅都有L2高速缓存未命中，因此可以删除一些变量.

让我们现在也删除第二个存储-因此，我们仅测试64K内存上的单个64字节跨步存储:

top:
mov    BYTE PTR [rdx],al
add    rdx,0x40
sub    rdi,0x1
jne    top

以与上述相同的方式运行，我得到了大约3.05个周期/存储²，尽管与我以前看到的相比有很多差异(-您甚至可以在那里找到3.0).

因此，我们已经知道，对于仅存储到L2 ¹的持续存储，我们可能不会做得比这更好.虽然Skylake显然在L1和L2之间具有64字节的吞吐量，但对于存储流而言，必须为两个从L1的逐出共享该带宽，并将新线路加载到L1中. 3个周期似乎是合理的，如果每个周期要说1个周期，以(a)从L1到L2清除脏的受害者行(b)用L2的新行更新L1，并且(c)将存储提交到L1.

当您添加一次后，在循环中再次写入同一高速缓存行(写入下一个字节，尽管事实无关紧要)，会发生什么情况?像这样:

top:
mov    BYTE PTR [rdx],al
mov    BYTE PTR [rdx+0x1],al
add    rdx,0x40
sub    rdi,0x1
jne    top

以下是上述循环中测试线束运行1000次的时间的直方图:

  count   cycles/itr
      1   3.0
     51   3.1
      5   3.2
      5   3.3
     12   3.4
    733   3.5
    139   3.6
     22   3.7
      2   3.8
     11   4.0
     16   4.1
      1   4.3
      2   4.4

因此，大多数时间都聚集在3.5个周期左右.这意味着该额外存储仅增加了0.5个周期的时间.可能就像存储缓冲区能够将两个存储区排到L1一样，但是它们只会发生一半的时间.

请考虑存储缓冲区包含一系列存储，例如1, 1, 2, 2, 3, 3，其中1表示高速缓存行:一半位置具有来自同一高速缓存行的两个连续值，而另一半则没有.当存储缓冲区正在等待耗尽存储空间，并且L1忙于从L2撤离并接受来自L2的行时，L1将在任意"点可用于存储，如果它位于1, 1位置，则可能是在一个周期内存储消耗，但如果在1, 2处，则需要两个周期.

请注意，在3.1左右(而不是3.5左右)存在另一个约6％的结果峰值.那可能是一个稳定的状态，我们总是可以得到幸运的结果.在约4.0-4.1处还有另一个约3％的峰值-总是不走运".

让我们通过考察第一家商店和第二家商店之间的各种差异来检验这一理论:

top:
mov    BYTE PTR [rdx + FIRST],al
mov    BYTE PTR [rdx + SECOND],al
add    rdx,0x40
sub    rdi,0x1
jne    top

我们尝试从0到256的所有FIRST和SECOND值，以8为步长.结果是，在垂直轴上更改FIRST值，在水平轴上更改SECOND:

我们看到一个特定的模式-白色值快速"(在上面讨论的3.0-4.1值附近，偏移量为1).黄色值较高，最多8个周期，红色值最多10.紫色异常值最高，通常是OP中描述的慢速模式"启动的情况(通常以18.0个周期/iter计时).我们注意到以下内容:

从白细胞的模式中，我们看到，只要第二个存储区相对于第一个存储区位于同一缓存行或下一个，我们就能获得约3.5个周期的快速结果店铺.这与上面的想法一致，即更有效地处理存储到同一缓存行的想法.在下一个高速缓存行中具有第二个存储区的原因是，模式的结果是相同的，除了第一个第一次访问:0, 0, 1, 1, 2, 2, ... vs 0, 1, 1, 2, 2, ...-在第二种情况下，第二个存储区是第一个存储区触摸每条缓存行.但是存储缓冲区无关紧要.一旦进入不同的缓存行，就会得到类似0, 2, 1, 3, 2, ...的模式，显然这很糟糕?
紫色离群值"永远不会出现在白色区域，因此显然仅限于已经很慢的场景(这里的慢值使它慢了约2.5倍:从〜8到18个周期) .

我们可以缩小一点并查看更大的偏移量:

相同的基本模式，尽管我们看到随着第二个存储区距离第一个存储区较远(向前或向后)，性能有所提高(绿色区域)，直到偏移约1700字节为止再次变差.即使在改进的区域中，我们最多只能达到5.8个循环/迭代，仍然比3.5的同线性能差很多.

如果您添加任何种在商店中先执行³的加载或预取指令，则整体的慢速性能和慢速模式"异常值都将消失:

您可以将其移植回原始的16步跨度-修复内核循环中的任何类型的预取或加载，对距离几乎不敏感(即使实际上是在之后)，已修复问题，您将获得2.3个循环/迭代，接近2.0的最佳可能理想值，并且等于两个具有单独循环的存储的总和.

因此，基本规则是，没有相应负载的情况下存储到L2的速度要比软件预取时慢得多-除非整个存储流以单个顺序模式访问缓存行.这与这样的线性模式永远不会从SW预取中受益的想法背道而驰.

我实际上并没有充实的解释，但其中可能包括以下因素:

在存储缓冲区中具有其他存储可能会减少去往L2的请求的并发性.尚不清楚何时会在L1中丢失的存储区分配存储缓冲区，但可能会在存储区即将退休时发生，并且存储缓冲区中存在一定数量的"lookhead"以将位置带入存储区中L1，因此在L1中拥有其他不容错过的存储会损害并发性，因为前瞻无法看到太多会丢失的请求.
对于L1和L2资源，例如读写端口，缓存间带宽，可能存在冲突，这种冲突在这种存储模式下更为严重.例如，当不同行的存储交错时，它们可能无法从存储队列中快速耗尽(请参见上面的内容，在某些情况下，每个周期可能有一个以上的存储耗尽).

McCalpin博士在英特尔论坛上的这些评论也很有趣.

⁰通常只能在禁用L2流光的情况下才能实现，因为否则L2上的额外争用会使此速度降低到每3.5个周期大约1行.

¹将此与存储区进行了对比，在每个存储区中，我得到的负载正好是1.5个周期，每个周期的隐含带宽约为43个字节.这很合理:L1 <-> L2带宽为64字节，但假设L1 接受来自L2的一行或为核心的负载请求提供服务每个周期(但不是并行)，则有3个周期，分别对不同的L2线进行两次加载:2个周期，接受来自L2的线路，以及1个周期，以满足两条加载指令.

²并预取了 off .事实证明，L2预取器在检测到流访问时会争夺对L2高速缓存的访问:即使它总是找到候选行并且不进入L3，这会减慢代码的速度并增加可变性.结论通常与预取有关，但是一切都稍慢一些(这里是结果很大启用预取功能-每个负载大约有3.3个周期，但变化很大.

³甚至根本不需要超前-预取后面的几行内容也可以:我想预取/加载只是在瓶颈之前迅速在商店前面运行，因此无论如何他们都领先.这样，预取是一种自我修复功能，似乎可以处理您输入的几乎所有值.

I'm seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that's always to the same location¹, like this:

volatile uint32_t value;

void weirdo_cpp(size_t iters, uint32_t* output) {

    uint32_t x = value;
    uint32_t          *rdx = output;
    volatile uint32_t *rsi = output;
    do {
        *rdx    = x;
        *rsi = x;

        rdx += 4;  // 16 byte stride
    } while (--iters > 0);
}

In assembly this loop probably³ looks like:

weirdo_cpp:

...

align 16
.top:
    mov    [rdx], eax  ; stride 16
    mov    [rsi], eax  ; never changes

    add    rdx, 16

    dec    rdi
    jne    .top

    ret

When the memory region accessed is in L2 I would expect this to run at less than 3 cycles per iteration. The second store just keeps hitting the same location and should add about a cycle. The first store implies bringing in a line from L2 and hence also evicting a line once every 4 iterations. I'm not sure how you evaluate the L2 cost, but even if you conservatively estimate that the L1 can only do one of the following every cycle: (a) commit a store or (b) receive a line from L2 or (c) evict a line to L2, you'd get something like 1 + 0.25 + 0.25 = 1.5 cycles for the stride-16 store stream.

Indeed, you comment out one store you get ~1.25 cycles per iteration for the first store only, and ~1.01 cycles per iteration for the second store, so 2.5 cycles per iteration seems like a conservative estimate.

The actual performance is very odd, however. Here's a typical run of the test harness:

Estimated CPU speed:  2.60 GHz
output size     :   64 KiB
output alignment:   32
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.89 cycles/iter,  1.49 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 4.73 cycles/iter,  1.81 ns/iter, cpu before: 0, cpu after: 0
 7.33 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.33 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.34 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.26 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.31 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.29 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.29 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.27 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.30 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.30 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0

Two things are weird here.

First are the bimodal timings: there is a fast mode and a slow mode. We start out in slow mode taking about 7.3 cycles per iteration, and at some point transition to about 3.9 cycles per iteration. This behavior is consistent and reproducible and the two timings are always quite consistent clustered around the two values. The transition shows up in both directions from slow mode to fast mode and the other way around (and sometimes multiple transitions in one run).

The other weird thing is the really bad performance. Even in fast mode, at about 3.9 cycles the performance is much worse than the 1.0 + 1.3 = 2.3 cycles worst cast you'd expect from adding together the each of the cases with a single store (and assuming that absolutely zero worked can be overlapped when both stores are in the loop). In slow mode, performance is terrible compared to what you'd expect based on first principles: it is taking 7.3 cycles to do 2 stores, and if you put it in L2 store bandwidth terms, that's roughly 29 cycles per L2 store (since we only store one full cache line every 4 iterations).

Skylake is recorded as having a 64B/cycle throughput between L1 and L2, which is way higher than the observed throughput here (about 2 bytes/cycle in slow mode).

What explains the poor throughput and bimodal performance and can I avoid it?

I'm also curious if this reproduces on other architectures and even on other Skylake boxes. Feel free to include local results in the comments.

You can find the test code and harness on github. There is a Makefile for Linux or Unix-like platforms, but it should be relatively easy to build on Windows too. If you want to run the asm variant you'll need nasm or yasm for the assembly⁴ - if you don't have that you can just try the C++ version.

Eliminated Possibilities

Here are some possibilities that I considered and largely eliminated. Many of the possibilities are eliminated by the simple fact that you see the performance transition randomly in the middle of the benchmarking loop, when many things simply haven't changed (e.g., if it was related to the output array alignment, it couldn't change in the middle of a run since the same buffer is used the entire time). I'll refer to this as the default elimination below (even for things that are default elimination there is often another argument to be made).

Alignment factors: the output array is 16 byte aligned, and I've tried up to 2MB alignment without change. Also eliminated by the default elimination.
Contention with other processes on the machine: the effect is observed more or less identically on an idle machine and even on a heavily loaded one (e.g., using stress -vm 4). The benchmark itself should be completely core-local anyways since it fits in L2, and perf confirms there are very few L2 misses per iteration (about 1 miss every 300-400 iterations, probably related to the printf code).
TurboBoost: TurboBoost is completely disabled, confirmed by three different MHz readings.
Power-saving stuff: The performance governor is intel_pstate in performance mode. No frequency variations are observed during the test (CPU stays essentially locked at 2.59 GHz).
TLB effects: The effect is present even when the output buffer is located in a 2 MB huge page. In any case, the 64 4k TLB entries more than cover the 128K output buffer. perf doesn't report any particularly weird TLB behavior.
4k aliasing: older, more complex versions of this benchmark did show some 4k aliasing but this has been eliminated since there are no loads in the benchmark (it's loads that might incorrectly alias earlier stores). Also eliminated by the default elimination.
L2 associativity conflicts: eliminated by the default elimination and by the fact that this doesn't go away even with 2MB pages, where we can be sure the output buffer is laid out linearly in physical memory.
Hyperthreading effects: HT is disabled.
Prefetching: Only two of the prefetchers could be involved here (the "DCU", aka L1<->L2 prefetchers), since all the data lives in L1 or L2, but the performance is the same with all prefetchers enabled or all disabled.
Interrupts: no correlation between interrupt count and slow mode. There is a limited number of total interrupts, mostly clock ticks.

toplev.py

I used toplev.py which implements Intel's Top Down analysis method, and to no surprise it identifies the benchmark as store bound:

BE             Backend_Bound:                                                      82.11 % Slots      [  4.83%]
BE/Mem         Backend_Bound.Memory_Bound:                                         59.64 % Slots      [  4.83%]
BE/Core        Backend_Bound.Core_Bound:                                           22.47 % Slots      [  4.83%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                                 0.03 % Stalls     [  4.92%]
    This metric estimates how often the CPU was stalled without
    loads missing the L1 data cache...
    Sampling events:  mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                             74.91 % Stalls     [  4.96%] <==
    This metric estimates how often CPU was stalled  due to
    store memory accesses...
    Sampling events:  mem_inst_retired.all_stores:pp
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:                         28.20 % Clocks     [  4.93%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized:         26.28 % CoreClocks [  4.83%]
    This metric represents Core cycles fraction where the CPU
    executed total of 1 uop per cycle on all execution ports...
               MUX:                                                                 4.65 %           
    PerfMon Event Multiplexing accuracy indicator

This doesn't really shed much light: we already knew it must be the stores messing things up, but why? Intel's description of the condition doesn't say much.

Here's a reasonable summary of some of the issues involved in L1-L2 interaction.

Update Feb 2019: I cannot no longer reproduce the "bimodal" part of the performance: for me, on the same i7-6700HQ box, the performance is now always very slow in the same cases the slow and very slow bimodal performance applies, i.e., with results around 16-20 cycles per line, like this:

This change seems to have been introduced in the August 2018 Skylake microcode update, revision 0xC6. The prior microcode, 0xC2 shows the original behavior described in the question.

¹ This is a greatly simplified MCVE of my original loop, which was at least 3 times the size and which did lots of additional work, but exhibited exactly the same performance as this simple version, bottlenecked on the same mysterious issue.

³ In particular, it looks exactly like this if you write the assembly by hand, or if you compile it with gcc -O1 (version 5.4.1), and probably most reasonable compilers (volatile is used to avoid sinking the mostly-dead second store outside the loop).

⁴ No doubt you could convert this to MASM syntax with a few minor edits since the assembly is so trivial. Pull requests accepted.

解决方案

What I've found so far. Unfortunately it doesn't really offer an explanation for the poor performance, and not at all for the bimodal distribution, but is more a set of rules for when you might see the performance and notes on mitigating it:

The store throughput into L2 appears to be at most one 64-byte cache-line per three cycles⁰, putting a ~21 bytes per cycle upper limit on store throughput. Said another way, series of stores that miss in L1 and hit in L2 will take at least three cycles per cache line touched.
Above that baseline there is a significant penalty when stores that hit in L2 are interleaved with stores to a different cache line (regardless of whether those stores hit in L1 or L2).
The penalty is apparently somewhat larger for stores that are nearby (but still not in the same cache line).
The bimodal performance is at least superficially related to above effect since in the non-interleaving case it does not appear to occur, although I don't have a further explanation for it.
If you ensure the cache line is already in L1 before the store, by prefetch or a dummy load, the slow performance disappears and the performance is no longer bimodal.

Details and Pictures

64-byte Stride

The original question arbitrarily used a stride of 16, but let's start with probably the simplest case: a stride of 64, i.e., one full cache line. As it turns out the various effects are visible with any stride, but 64 ensures an L2 cache miss on every stride and so removes some variables.

Let's also remove the second store for now - so we're just testing a single 64-byte strided store over 64K of memory:

top:
mov    BYTE PTR [rdx],al
add    rdx,0x40
sub    rdi,0x1
jne    top

Running this in the same harness as above, I get about 3.05 cycles/store², although there is quite a bit of variance compared to what I'm used to seeing ( - you can even find a 3.0 in there).

So we know already we probably aren't going to do better than this for sustained stores purely to L2¹. While Skylake apparently has a 64 byte throughput between L1 and L2, in the case of a stream of stores, that bandwidth has to be shared for both evictions from L1, and to load the new line into L1. 3 cycles seems reasonable if it takes say 1 cycle each to (a) evict the dirty victim line from L1 to L2 (b) update L1 with the new line from L2 and (c) commit the store into L1.

What happens when you add do a second write to the same cache line (to the next byte, although it turns out not to matter) in the loop? Like this:

top:
mov    BYTE PTR [rdx],al
mov    BYTE PTR [rdx+0x1],al
add    rdx,0x40
sub    rdi,0x1
jne    top

Here's a histogram of the timing for 1000 runs of the test harness for the above loop:

  count   cycles/itr
      1   3.0
     51   3.1
      5   3.2
      5   3.3
     12   3.4
    733   3.5
    139   3.6
     22   3.7
      2   3.8
     11   4.0
     16   4.1
      1   4.3
      2   4.4

So the majority of times are clustered around 3.5 cycles. That means that this additional store only added 0.5 cycles to the timing. It could be something like the store buffer is able to drain two stores to the L1 if they are in the same line, but this only happens about half the time.

Consider that the store buffer contains a series of stores like 1, 1, 2, 2, 3, 3 where 1 indicates the cache line: half of the positions have two consecutive values from the same cache line and half don't. As the store buffer is waiting to drain stores, and the L1 is busily evicting to and accepting lines from L2, the L1 will come available for a store at an "arbitrary" point, and if it is at the position 1, 1 maybe the stores drain in one cycle, but if it's at 1, 2 it takes two cycles.

Note there is another peak of about 6% of results around 3.1 rather than 3.5. That could be a steady state where we always get the lucky outcome. There is another peak of around 3% at ~4.0-4.1 - the "always unlucky" arrangement.

Let's test this theory by looking at various offsets between the first and second stores:

top:
mov    BYTE PTR [rdx + FIRST],al
mov    BYTE PTR [rdx + SECOND],al
add    rdx,0x40
sub    rdi,0x1
jne    top

We try all values of FIRST and SECOND from 0 to 256 in steps of 8. The results, with varying FIRST values on the vertical axis and SECOND on the horizontal:

We see a specific pattern - the white values are "fast" (around the 3.0-4.1 values discussed above for the offset of 1). Yellow values are higher, up to 8 cycles, and red up to 10. The purple outliers are the highest and are usually cases where the "slow mode" described in the OP kicks in (usually clocking in a 18.0 cycles/iter). We notice the following:

From the pattern of white cells, we see that we get the fast ~3.5 cycle result as long as the second store is in the same cache line or the next relative to the first store. This is consistent with the idea above that stores to the same cache line are handled more efficiently. The reason that having the second store in the next cache line works is that the pattern ends up being the same, except for the first first access: 0, 0, 1, 1, 2, 2, ... vs 0, 1, 1, 2, 2, ... - where in the second case it is the second store that first touches each cache line. The store buffer doesn't care though. As soon as you get into different cache lines, you get a pattern like 0, 2, 1, 3, 2, ... and apparently this sucks?
The purple "outliers" are never appear in the white areas, so are apparently restricted to the scenario that is already slow (and the slow more here makes it about 2.5x slower: from ~8 to 18 cycles).

We can zoom out a bit and look at even larger offsets:

The same basic pattern, although we see that the performance improves (green area) as the second store gets further away (ahead or behind) the first one, up until it gets worse again at an offset of about ~1700 bytes. Even in the improved area we only get to at best 5.8 cycles/iteration still much worse than the same-line performance of 3.5.

If you add any kind of load or prefetch instruction that runs ahead³ of the stores, both the overall slow performance and the "slow mode" outliers disappear:

You can port this back to the original stride by 16 problem - any type of prefetch or load in the core loop, pretty much insensitive of the distance (even if it's behind in fact), fixes the issue and you get 2.3 cycles/iteration, close to the best possible ideal of 2.0, and equal to the sum of the two stores with separate loops.

So the basic rule is that stores to L2 without corresponding loads are much slower than if you software prefetch them - unless the entire store stream accesses cache lines in a single sequential pattern. That's contrary to the idea that a linear pattern like this never benefits from SW prefetch.

I don't really have a fleshed out explanation, but it could include these factors:

Having other stores in the store buffers may reduce the concurrency of the requests going to L2. It isn't clear exactly when stores that are going to miss in L1 allocate a store buffer, but perhaps it occurs near when the store is going to retire and there is a certain amount of "lookhead" into the store buffer to bring locations into L1, so having additional stores that aren't going to miss in L1 hurts the concurrency since the lookahead can't see as many requests that will miss.
Perhaps there are conflicts for L1 and L2 resources like read and write ports, inter-cache bandwidth, that are worse with this pattern of stores. For example when stores to different lines interleave, maybe they cannot drain as quickly from the store queue (see above where it appears that in some scenarios more than one store may drain per cycle).

These comments by Dr. McCalpin on the Intel forums are also quite interesting.

⁰ Mostly only achievable with the L2 streamer disabled since otherwise the additional contention on the L2 slows this down to about 1 line per 3.5 cycles.

¹ Contrast this with stores, where I get almost exactly 1.5 cycles per load, for an implied bandwidth of ~43 bytes per cycle. This makes perfect sense: the L1<->L2 bandwith is 64 bytes, but assuming that the L1 is either accepting a line from the L2 or servicing load requests from the core every cycle (but not both in parallel) then you have 3 cycles for two loads to different L2 lines: 2 cycles to accept the lines from L2, and 1 cycle to satisfy two load instructions.

² With prefetching off. As it turns out, the L2 prefetcher competes for access to the L2 cache when it detects streaming access: even though it always finds the candidate lines and doesn't go to L3, this slows down the code and increases variability. The conclusions generally hold with prefetching on, but everything is just a bit slower (here's a big blob of results with prefetching on - you see about 3.3 cycles per load, but with lots of variability).

³ It doesn't even really need to be ahead - prefetching several lines behind also works: I guess the prefetch/loads just quickly run ahead of the stores which are bottlenecked so they get ahead anyways. In this way, the prefetching is kind of self-healing and seems to work with almost any value you put in.

这篇关于英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能 [英] Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

问题描述

排除的可能性

toplev.py

详细信息和图片

64字节跨度

Eliminated Possibilities

toplev.py

Details and Pictures

64-byte Stride

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能 [英] Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

问题描述

排除的可能性

toplev.py

详细信息和图片

64字节跨度

Eliminated Possibilities

toplev.py

Details and Pictures

64-byte Stride

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭