Intel Skylake 上存储循环的意外糟糕且奇怪的双峰性能 [英] Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

查看：25 发布时间：2021/12/18 9:13:52 performance assembly optimization x86 x86-64

本文介绍了Intel Skylake 上存储循环的意外糟糕且奇怪的双峰性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我看到一个简单的存储循环的性能出乎意料地糟糕，它有两个存储:一个向前跨度为 16 字节，另一个总是到相同的位置¹，如下所示:

volatile uint32_t 值；无效的奇怪_cpp(size_t iters，uint32_t* 输出){uint32_t x = 值；uint32_t *rdx = 输出；易失性 uint32_t *rsi = 输出；做 {*rdx = x;*rsi = x;rdx += 4;//16 字节步幅} while (--iters > 0);}

在汇编中这个循环可能³看起来像:

weirdo_cpp:...对齐 16.最佳:mov [rdx], eax ;步幅 16mov [rsi], eax ;从不改变添加 rdx, 16十二月jne .top退

当访问的内存区域在 L2 中时，我希望每次迭代运行的周期少于 3 个.第二家商店只是继续点击相同的位置，应该增加一个周期.第一个 store 意味着从 L2 引入一条线，因此也每 4 次迭代驱逐一条线.我不确定您如何评估 L2 成本，但即使您保守估计 L1 每个周期只能执行以下一项操作:(a) 提交商店或 (b) 从 L2 接收一行或 (c)将一行移出到 L2，对于 stride-16 存储流，您会得到 1 + 0.25 + 0.25 = 1.5 个周期.

实际上，您注释掉了一个存储，仅第一次存储每次迭代获得约 1.25 个周期，第二个存储每次迭代获得约 1.01 个周期，因此每次迭代 2.5 个周期似乎是一个保守的估计.

然而，实际表现很奇怪.这是测试工具的典型运行:

估计 CPU 速度:2.60 GHz输出大小:64 KiB输出对齐:323.90 周期/迭代，1.50 纳秒/迭代，cpu 之前:0，cpu 之后:03.90 周期/迭代，1.50 纳秒/迭代，cpu 之前:0，cpu 之后:03.90 周期/迭代，1.50 纳秒/迭代，cpu 之前:0，cpu 之后:03.89 周期/迭代，1.49 ns/迭代，cpu 之前:0，cpu 之后:03.90 周期/迭代，1.50 纳秒/迭代，cpu 之前:0，cpu 之后:04.73 周期/迭代，1.81 纳秒/迭代，cpu 之前:0，cpu 之后:07.33 周期/迭代，2.81 纳秒/迭代，cpu 之前:0，cpu 之后:07.33 周期/迭代，2.81 纳秒/迭代，cpu 之前:0，cpu 之后:07.34 周期/迭代，2.81 纳秒/迭代，cpu 之前:0，cpu 之后:07.26 周期/迭代，2.80 ns/迭代，cpu 之前:0，cpu 之后:07.28 周期/迭代，2.80 纳秒/迭代，cpu 之前:0，cpu 之后:07.31 周期/迭代，2.81 纳秒/迭代，cpu 之前:0，cpu 之后:07.29 周期/迭代，2.81 纳秒/迭代，cpu 之前:0，cpu 之后:07.28 周期/迭代，2.80 纳秒/迭代，cpu 之前:0，cpu 之后:07.29 周期/迭代，2.80 纳秒/迭代，cpu 之前:0，cpu 之后:07.27 周期/迭代，2.80 纳秒/迭代，cpu 之前:0，cpu 之后:07.30 周期/迭代，2.81 纳秒/迭代，cpu 之前:0，cpu 之后:07.30 周期/迭代，2.81 纳秒/迭代，cpu 之前:0，cpu 之后:07.28 周期/迭代，2.80 纳秒/迭代，cpu 之前:0，cpu 之后:07.28 周期/迭代，2.80 纳秒/迭代，cpu 之前:0，cpu 之后:0

这里有两件事很奇怪.

首先是双峰时序:有快速模式和慢速模式.我们以慢速模式开始，每次迭代大约需要 7.3 个周期，然后在某个时候过渡到每次迭代大约 3.9 个周期.这种行为是一致且可重现的，并且两个时间总是非常一致地聚集在两个值周围.从慢速模式到快速模式的两个方向都会出现转换，反之亦然(有时在一次运行中有多个转换).

另一个奇怪的事情是非常糟糕的表现.即使在快速模式中，在大约 3.9 个周期时，性能也比 1.0 + 1.3 = 2.3 个周期最差，您期望将每个案例与单个存储相加(并假设当两个商店都在循环中时，绝对零工作可以重叠).在慢速模式中，与您基于第一原则所期望的相比，性能很糟糕:执行 2 个存储需要 7.3 个周期，如果将它放在 L2 存储带宽方面，那大约是<强>29 个周期每个 L2 存储(因为我们每 4 次迭代只存储一个完整的缓存行).

Skylake 被

此更改似乎已在 2018 年 8 月 Skylake 微代码更新版本 0xC6 中引入.先前的微码 0xC2 显示了问题中描述的原始行为.

<小时>

¹ 这是我原始循环的一个大大简化的 MCVE，它的大小至少是原来的 3 倍，并做了很多额外的工作，但表现出与这个简单版本完全相同的性能，遇到了瓶颈在同一个神秘的问题上.

³ 特别是，如果您手动编写程序集，或者使用 gcc -O1 编译它，它看起来完全像这样(版本 5.4.1)，并且可能是最合理的编译器(volatile 用于避免将大部分死的第二个存储沉入循环之外).

⁴ 毫无疑问，您可以通过一些小的编辑将其转换为 MASM 语法，因为程序集非常简单.接受拉取请求.

解决方案

我目前发现的内容.不幸的是，它并没有真正解释性能不佳的原因，也根本没有解释双峰分布，而更多的是一组规则，用于您何时可能看到性能和减轻它的注意事项:

进入 L2 的存储吞吐量似乎是每三个周期最多 1 个 64 字节的缓存线⁰，这为存储吞吐量设置了每个周期约 21 字节的上限.换句话说，在 L1 中未命中并在 L2 中命中的一系列存储将花费至少每个缓存行三个周期.
在该基线之上，当在 L2 中命中的存储与交错存储到不同的缓存行(无论这些存储是在 L1 中还是在 L1 中命中，或L2).
对于附近(但仍不在同一缓存行中)的商店，惩罚显然更大一些.
双峰性能至少在表面上与上述效果相关，因为在非交错情况下它似乎不会发生，尽管我没有进一步的解释.
如果在存储之前通过预取或虚拟加载确保缓存行已经在 L1 中，那么缓慢的性能就会消失，性能不再是双峰的.

细节和图片

64 字节步幅

原始问题随意使用了 16 的步长，但让我们从最简单的情况开始:步长为 64，即一个完整的缓存行.事实证明，任何步幅都可以看到各种效果，但 64 可确保每个步幅的 L2 缓存未命中，因此删除了一些变量.

现在我们也删除第二个存储 - 所以我们只是在 64K 内存上测试单个 64 字节跨步存储:

顶部:mov BYTE PTR [rdx],al添加 rdx,0x40子 rdi,0x1顶

在与上面相同的线束中运行它，我得到大约 3.05 个周期/存储²，尽管与我过去看到的相比有相当多的差异( - 你甚至可以在那里找到一个 3.0).

所以我们已经知道，对于纯粹到 L2¹ 的持续存储，我们可能不会做得比这更好.虽然 Skylake 显然在 L1 和 L2 之间有 64 字节的吞吐量，但在存储流的情况下，必须为 L1 的两个驱逐共享该带宽，并将新行加载到 L1.如果每个周期需要 1 个周期来 (a) 将脏受害者行从 L1 驱逐到 L2 (b) 使用来自 L2 的新行更新 L1 和 (c) 将存储提交到 L1，那么 3 个周期似乎是合理的.

当你在循环中对同一个缓存行(下一个字节，尽管结果无关紧要)添加第二次写入时会发生什么?像这样:

顶部:mov BYTE PTR [rdx],almov BYTE PTR [rdx+0x1],al添加 rdx,0x40子 rdi,0x1顶

以下是上述循环的测试工具运行 1000 次的时间直方图:

 计数周期/itr1 3.051 3.15 3.25 3.312 3.4733 3.5139 3.622 3.72 3.811 4.016 4.11 4.32 4.4

所以大部分时间都聚集在 3.5 个周期左右.这意味着这个额外的存储只为时序增加了 0.5 个周期.这可能类似于存储缓冲区能够将两个存储在同一行中的存储排空到 L1，但这只会发生大约一半的时间.

考虑存储缓冲区包含一系列存储，如1,1,2,2,3,3，其中1表示缓存行:位置的一半具有来自同一缓存行的两个连续值，而另一半则没有.由于存储缓冲区正在等待排空存储，并且 L1 忙于从 L2 逐出和接受线路，L1 将在任意"点可供存储使用，如果它位于 1 位置，1 也许存储在一个周期内耗尽，但如果它在 1, 2 需要两个周期.

请注意，在 3.1 而不是 3.5 附近还有大约 6% 的结果峰值.那可能是一种稳定状态，我们总是会得到幸运的结果.在 ~4.0-4.1 处还有一个大约 3% 的峰值——总是不走运"的安排.

让我们通过查看第一个和第二个商店之间的各种偏移来测试这个理论:

顶部:mov BYTE PTR [rdx + FIRST],almov BYTE PTR [rdx + SECOND],al添加 rdx,0x40子 rdi,0x1顶

我们以 8 的步长尝试 FIRST 和 SECOND 的所有值，从 0 到 256.结果，在不同的 FIRST 值上纵轴和横轴SECOND:

我们看到一个特定的模式 - 白色值快"(在上面讨论的 3.0-4.1 值附近，偏移量为 1).黄色值更高，最多 8 个周期，红色值最多 10 个.紫色异常值最高，通常是 OP 中描述的慢速模式"启动的情况(通常以 18.0 个周期/迭代计时钟).我们注意到以下几点:

从白色单元格的模式中，我们看到只要第二个存储与第一个存储在相同的缓存行或下一个，我们就可以得到大约 3.5 个周期的快速结果店铺.这与上面的想法是一致的，即存储到同一高速缓存行的处理效率更高.在下一个缓存行中使用第二个存储的原因是模式最终是相同的，除了第一次访问:0, 0, 1, 1, 2, 2, ... vs 0, 1, 1, 2, 2, ... - 在第二种情况下，第二个存储首先接触每个缓存行.不过，存储缓冲区并不关心.一旦你进入不同的缓存行，你就会得到一个像 0, 2, 1, 3, 2, ... 这样的模式，显然这很糟糕?
紫色的异常值"永远不会出现在白色区域，因此显然仅限于已经很慢的场景(这里的更慢使它慢了大约 2.5 倍:从 ~8 到 18 个周期).

我们可以缩小一点，看看更大的偏移:

相同的基本模式，尽管我们看到性能提高(绿色区域)，因为第二个存储离第一个更远(领先或落后)，直到它在大约 1700 字节的偏移处再次变得更糟.即使在改进的区域，我们最多也只能达到 5.8 个循环/迭代，但仍然比 3.5 的同线性能差得多.

如果您添加任何类型的加载或预取指令，它在存储之前³运行，则整体缓慢性能和慢速模式"异常值都会消失:

您可以将其移植回原始步长 16 问题 - 核心循环中的任何类型的预取或加载，对距离几乎不敏感(即使它实际上落后)，修复这个问题，你得到 2.3 个循环/迭代，接近 2.0 的最佳可能理想值，并且等于具有单独循环的两个存储的总和.

所以基本规则是，在没有相应加载的情况下存储到 L2 比软件预取它们要慢得多 - 除非整个存储流以单一顺序模式访问缓存行.这与像这样的线性模式永远不会从 SW 预取中受益的想法相反.

我真的没有详细的解释，但它可能包括以下因素:

在存储缓冲区中拥有其他存储可能会降低进入 L2 的请求的并发性.尚不清楚 L1 中将要丢失的存储何时分配存储缓冲区，但可能会在存储即将退休时发生，并且在存储缓冲区中有一定数量的lookhead"以将位置引入L1，因此在 L1 中拥有不会遗漏的额外存储会损害并发性，因为前瞻无法看到尽可能多的会遗漏的请求.
也许 L1 和 L2 资源存在冲突，例如读写端口、缓存间带宽，这种存储模式更糟.例如，当存储到不同线路的存储交错时，它们可能无法从存储队列中快速排出(参见上文，在某些情况下，每个周期可能排出不止一个存储).

McCalpin 博士在英特尔论坛上的这些评论也很有趣.

<小时>

⁰ 大多数情况下只能在禁用 L2 流传输器的情况下实现，否则 L2 上的额外争用会将其减慢到每 3.5 个周期大约 1 行.

¹ 将此与存储进行对比，在存储中，我几乎每次加载 1.5 个周期，每个周期的隐含带宽约为 43 个字节.这是完全合理的:L1<->L2 带宽是 64 字节，但假设 L1 要么接受来自 L2 的线路或服务来自核心的负载请求每个周期(但不是两个并行)，那么您有 3 个周期将两个负载加载到不同的 L2 线:2 个周期接受来自 L2 的线，1 个周期满足两个加载指令.

² 预取关闭.事实证明，当 L2 预取器检测到流访问时，它会竞争对 L2 缓存的访问:即使它总是找到候选行并且不会进入 L3，这会减慢代码并增加可变性.结论通常适用于预取，但一切都慢了一点(这里有一个大量结果启用预取 - 您会看到每次加载大约 3.3 个周期，但有很多可变性).

³ 它甚至不需要提前 - 预取后面的几行也有效:我猜预取/加载只是在遇到瓶颈的商店之前快速运行，所以他们无论如何都会领先.通过这种方式，预取是一种自我修复，似乎可以处理您输入的几乎任何值.

I'm seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that's always to the same location¹, like this:

volatile uint32_t value;

void weirdo_cpp(size_t iters, uint32_t* output) {

    uint32_t x = value;
    uint32_t          *rdx = output;
    volatile uint32_t *rsi = output;
    do {
        *rdx    = x;
        *rsi = x;

        rdx += 4;  // 16 byte stride
    } while (--iters > 0);
}

In assembly this loop probably³ looks like:

weirdo_cpp:

...

align 16
.top:
    mov    [rdx], eax  ; stride 16
    mov    [rsi], eax  ; never changes

    add    rdx, 16

    dec    rdi
    jne    .top

    ret

When the memory region accessed is in L2 I would expect this to run at less than 3 cycles per iteration. The second store just keeps hitting the same location and should add about a cycle. The first store implies bringing in a line from L2 and hence also evicting a line once every 4 iterations. I'm not sure how you evaluate the L2 cost, but even if you conservatively estimate that the L1 can only do one of the following every cycle: (a) commit a store or (b) receive a line from L2 or (c) evict a line to L2, you'd get something like 1 + 0.25 + 0.25 = 1.5 cycles for the stride-16 store stream.

Indeed, you comment out one store you get ~1.25 cycles per iteration for the first store only, and ~1.01 cycles per iteration for the second store, so 2.5 cycles per iteration seems like a conservative estimate.

The actual performance is very odd, however. Here's a typical run of the test harness:

Estimated CPU speed:  2.60 GHz
output size     :   64 KiB
output alignment:   32
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 3.89 cycles/iter,  1.49 ns/iter, cpu before: 0, cpu after: 0
 3.90 cycles/iter,  1.50 ns/iter, cpu before: 0, cpu after: 0
 4.73 cycles/iter,  1.81 ns/iter, cpu before: 0, cpu after: 0
 7.33 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.33 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.34 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.26 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.31 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.29 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.29 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.27 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.30 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.30 cycles/iter,  2.81 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0
 7.28 cycles/iter,  2.80 ns/iter, cpu before: 0, cpu after: 0

Two things are weird here.

First are the bimodal timings: there is a fast mode and a slow mode. We start out in slow mode taking about 7.3 cycles per iteration, and at some point transition to about 3.9 cycles per iteration. This behavior is consistent and reproducible and the two timings are always quite consistent clustered around the two values. The transition shows up in both directions from slow mode to fast mode and the other way around (and sometimes multiple transitions in one run).

The other weird thing is the really bad performance. Even in fast mode, at about 3.9 cycles the performance is much worse than the 1.0 + 1.3 = 2.3 cycles worst cast you'd expect from adding together the each of the cases with a single store (and assuming that absolutely zero worked can be overlapped when both stores are in the loop). In slow mode, performance is terrible compared to what you'd expect based on first principles: it is taking 7.3 cycles to do 2 stores, and if you put it in L2 store bandwidth terms, that's roughly 29 cycles per L2 store (since we only store one full cache line every 4 iterations).

Skylake is recorded as having a 64B/cycle throughput between L1 and L2, which is way higher than the observed throughput here (about 2 bytes/cycle in slow mode).

What explains the poor throughput and bimodal performance and can I avoid it?

I'm also curious if this reproduces on other architectures and even on other Skylake boxes. Feel free to include local results in the comments.

You can find the test code and harness on github. There is a Makefile for Linux or Unix-like platforms, but it should be relatively easy to build on Windows too. If you want to run the asm variant you'll need nasm or yasm for the assembly⁴ - if you don't have that you can just try the C++ version.

Eliminated Possibilities

Here are some possibilities that I considered and largely eliminated. Many of the possibilities are eliminated by the simple fact that you see the performance transition randomly in the middle of the benchmarking loop, when many things simply haven't changed (e.g., if it was related to the output array alignment, it couldn't change in the middle of a run since the same buffer is used the entire time). I'll refer to this as the default elimination below (even for things that are default elimination there is often another argument to be made).

Alignment factors: the output array is 16 byte aligned, and I've tried up to 2MB alignment without change. Also eliminated by the default elimination.
Contention with other processes on the machine: the effect is observed more or less identically on an idle machine and even on a heavily loaded one (e.g., using stress -vm 4). The benchmark itself should be completely core-local anyways since it fits in L2, and perf confirms there are very few L2 misses per iteration (about 1 miss every 300-400 iterations, probably related to the printf code).
TurboBoost: TurboBoost is completely disabled, confirmed by three different MHz readings.
Power-saving stuff: The performance governor is intel_pstate in performance mode. No frequency variations are observed during the test (CPU stays essentially locked at 2.59 GHz).
TLB effects: The effect is present even when the output buffer is located in a 2 MB huge page. In any case, the 64 4k TLB entries more than cover the 128K output buffer. perf doesn't report any particularly weird TLB behavior.
4k aliasing: older, more complex versions of this benchmark did show some 4k aliasing but this has been eliminated since there are no loads in the benchmark (it's loads that might incorrectly alias earlier stores). Also eliminated by the default elimination.
L2 associativity conflicts: eliminated by the default elimination and by the fact that this doesn't go away even with 2MB pages, where we can be sure the output buffer is laid out linearly in physical memory.
Hyperthreading effects: HT is disabled.
Prefetching: Only two of the prefetchers could be involved here (the "DCU", aka L1<->L2 prefetchers), since all the data lives in L1 or L2, but the performance is the same with all prefetchers enabled or all disabled.
Interrupts: no correlation between interrupt count and slow mode. There is a limited number of total interrupts, mostly clock ticks.

toplev.py

I used toplev.py which implements Intel's Top Down analysis method, and to no surprise it identifies the benchmark as store bound:

BE             Backend_Bound:                                                      82.11 % Slots      [  4.83%]
BE/Mem         Backend_Bound.Memory_Bound:                                         59.64 % Slots      [  4.83%]
BE/Core        Backend_Bound.Core_Bound:                                           22.47 % Slots      [  4.83%]
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                                 0.03 % Stalls     [  4.92%]
    This metric estimates how often the CPU was stalled without
    loads missing the L1 data cache...
    Sampling events:  mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                             74.91 % Stalls     [  4.96%] <==
    This metric estimates how often CPU was stalled  due to
    store memory accesses...
    Sampling events:  mem_inst_retired.all_stores:pp
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:                         28.20 % Clocks     [  4.93%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized:         26.28 % CoreClocks [  4.83%]
    This metric represents Core cycles fraction where the CPU
    executed total of 1 uop per cycle on all execution ports...
               MUX:                                                                 4.65 %           
    PerfMon Event Multiplexing accuracy indicator

This doesn't really shed much light: we already knew it must be the stores messing things up, but why? Intel's description of the condition doesn't say much.

Here's a reasonable summary of some of the issues involved in L1-L2 interaction.

Update Feb 2019: I cannot no longer reproduce the "bimodal" part of the performance: for me, on the same i7-6700HQ box, the performance is now always very slow in the same cases the slow and very slow bimodal performance applies, i.e., with results around 16-20 cycles per line, like this:

This change seems to have been introduced in the August 2018 Skylake microcode update, revision 0xC6. The prior microcode, 0xC2 shows the original behavior described in the question.

¹ This is a greatly simplified MCVE of my original loop, which was at least 3 times the size and which did lots of additional work, but exhibited exactly the same performance as this simple version, bottlenecked on the same mysterious issue.

³ In particular, it looks exactly like this if you write the assembly by hand, or if you compile it with gcc -O1 (version 5.4.1), and probably most reasonable compilers (volatile is used to avoid sinking the mostly-dead second store outside the loop).

⁴ No doubt you could convert this to MASM syntax with a few minor edits since the assembly is so trivial. Pull requests accepted.

解决方案

What I've found so far. Unfortunately it doesn't really offer an explanation for the poor performance, and not at all for the bimodal distribution, but is more a set of rules for when you might see the performance and notes on mitigating it:

The store throughput into L2 appears to be at most one 64-byte cache-line per three cycles⁰, putting a ~21 bytes per cycle upper limit on store throughput. Said another way, series of stores that miss in L1 and hit in L2 will take at least three cycles per cache line touched.
Above that baseline there is a significant penalty when stores that hit in L2 are interleaved with stores to a different cache line (regardless of whether those stores hit in L1 or L2).
The penalty is apparently somewhat larger for stores that are nearby (but still not in the same cache line).
The bimodal performance is at least superficially related to above effect since in the non-interleaving case it does not appear to occur, although I don't have a further explanation for it.
If you ensure the cache line is already in L1 before the store, by prefetch or a dummy load, the slow performance disappears and the performance is no longer bimodal.

Details and Pictures

64-byte Stride

The original question arbitrarily used a stride of 16, but let's start with probably the simplest case: a stride of 64, i.e., one full cache line. As it turns out the various effects are visible with any stride, but 64 ensures an L2 cache miss on every stride and so removes some variables.

Let's also remove the second store for now - so we're just testing a single 64-byte strided store over 64K of memory:

top:
mov    BYTE PTR [rdx],al
add    rdx,0x40
sub    rdi,0x1
jne    top

Running this in the same harness as above, I get about 3.05 cycles/store², although there is quite a bit of variance compared to what I'm used to seeing ( - you can even find a 3.0 in there).

So we know already we probably aren't going to do better than this for sustained stores purely to L2¹. While Skylake apparently has a 64 byte throughput between L1 and L2, in the case of a stream of stores, that bandwidth has to be shared for both evictions from L1, and to load the new line into L1. 3 cycles seems reasonable if it takes say 1 cycle each to (a) evict the dirty victim line from L1 to L2 (b) update L1 with the new line from L2 and (c) commit the store into L1.

What happens when you add do a second write to the same cache line (to the next byte, although it turns out not to matter) in the loop? Like this:

top:
mov    BYTE PTR [rdx],al
mov    BYTE PTR [rdx+0x1],al
add    rdx,0x40
sub    rdi,0x1
jne    top

Here's a histogram of the timing for 1000 runs of the test harness for the above loop:

  count   cycles/itr
      1   3.0
     51   3.1
      5   3.2
      5   3.3
     12   3.4
    733   3.5
    139   3.6
     22   3.7
      2   3.8
     11   4.0
     16   4.1
      1   4.3
      2   4.4

So the majority of times are clustered around 3.5 cycles. That means that this additional store only added 0.5 cycles to the timing. It could be something like the store buffer is able to drain two stores to the L1 if they are in the same line, but this only happens about half the time.

Consider that the store buffer contains a series of stores like 1, 1, 2, 2, 3, 3 where 1 indicates the cache line: half of the positions have two consecutive values from the same cache line and half don't. As the store buffer is waiting to drain stores, and the L1 is busily evicting to and accepting lines from L2, the L1 will come available for a store at an "arbitrary" point, and if it is at the position 1, 1 maybe the stores drain in one cycle, but if it's at 1, 2 it takes two cycles.

Note there is another peak of about 6% of results around 3.1 rather than 3.5. That could be a steady state where we always get the lucky outcome. There is another peak of around 3% at ~4.0-4.1 - the "always unlucky" arrangement.

Let's test this theory by looking at various offsets between the first and second stores:

top:
mov    BYTE PTR [rdx + FIRST],al
mov    BYTE PTR [rdx + SECOND],al
add    rdx,0x40
sub    rdi,0x1
jne    top

We try all values of FIRST and SECOND from 0 to 256 in steps of 8. The results, with varying FIRST values on the vertical axis and SECOND on the horizontal:

We see a specific pattern - the white values are "fast" (around the 3.0-4.1 values discussed above for the offset of 1). Yellow values are higher, up to 8 cycles, and red up to 10. The purple outliers are the highest and are usually cases where the "slow mode" described in the OP kicks in (usually clocking in a 18.0 cycles/iter). We notice the following:

From the pattern of white cells, we see that we get the fast ~3.5 cycle result as long as the second store is in the same cache line or the next relative to the first store. This is consistent with the idea above that stores to the same cache line are handled more efficiently. The reason that having the second store in the next cache line works is that the pattern ends up being the same, except for the first first access: 0, 0, 1, 1, 2, 2, ... vs 0, 1, 1, 2, 2, ... - where in the second case it is the second store that first touches each cache line. The store buffer doesn't care though. As soon as you get into different cache lines, you get a pattern like 0, 2, 1, 3, 2, ... and apparently this sucks?
The purple "outliers" are never appear in the white areas, so are apparently restricted to the scenario that is already slow (and the slow more here makes it about 2.5x slower: from ~8 to 18 cycles).

We can zoom out a bit and look at even larger offsets:

The same basic pattern, although we see that the performance improves (green area) as the second store gets further away (ahead or behind) the first one, up until it gets worse again at an offset of about ~1700 bytes. Even in the improved area we only get to at best 5.8 cycles/iteration still much worse than the same-line performance of 3.5.

If you add any kind of load or prefetch instruction that runs ahead³ of the stores, both the overall slow performance and the "slow mode" outliers disappear:

You can port this back to the original stride by 16 problem - any type of prefetch or load in the core loop, pretty much insensitive of the distance (even if it's behind in fact), fixes the issue and you get 2.3 cycles/iteration, close to the best possible ideal of 2.0, and equal to the sum of the two stores with separate loops.

So the basic rule is that stores to L2 without corresponding loads are much slower than if you software prefetch them - unless the entire store stream accesses cache lines in a single sequential pattern. That's contrary to the idea that a linear pattern like this never benefits from SW prefetch.

I don't really have a fleshed out explanation, but it could include these factors:

Having other stores in the store buffers may reduce the concurrency of the requests going to L2. It isn't clear exactly when stores that are going to miss in L1 allocate a store buffer, but perhaps it occurs near when the store is going to retire and there is a certain amount of "lookhead" into the store buffer to bring locations into L1, so having additional stores that aren't going to miss in L1 hurts the concurrency since the lookahead can't see as many requests that will miss.
Perhaps there are conflicts for L1 and L2 resources like read and write ports, inter-cache bandwidth, that are worse with this pattern of stores. For example when stores to different lines interleave, maybe they cannot drain as quickly from the store queue (see above where it appears that in some scenarios more than one store may drain per cycle).

These comments by Dr. McCalpin on the Intel forums are also quite interesting.

⁰ Mostly only achievable with the L2 streamer disabled since otherwise the additional contention on the L2 slows this down to about 1 line per 3.5 cycles.

¹ Contrast this with stores, where I get almost exactly 1.5 cycles per load, for an implied bandwidth of ~43 bytes per cycle. This makes perfect sense: the L1<->L2 bandwith is 64 bytes, but assuming that the L1 is either accepting a line from the L2 or servicing load requests from the core every cycle (but not both in parallel) then you have 3 cycles for two loads to different L2 lines: 2 cycles to accept the lines from L2, and 1 cycle to satisfy two load instructions.

² With prefetching off. As it turns out, the L2 prefetcher competes for access to the L2 cache when it detects streaming access: even though it always finds the candidate lines and doesn't go to L3, this slows down the code and increases variability. The conclusions generally hold with prefetching on, but everything is just a bit slower (here's a big blob of results with prefetching on - you see about 3.3 cycles per load, but with lots of variability).

³ It doesn't even really need to be ahead - prefetching several lines behind also works: I guess the prefetch/loads just quickly run ahead of the stores which are bottlenecked so they get ahead anyways. In this way, the prefetching is kind of self-healing and seems to work with almost any value you put in.

这篇关于Intel Skylake 上存储循环的意外糟糕且奇怪的双峰性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Intel Skylake 上存储循环的意外糟糕且奇怪的双峰性能 [英] Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

问题描述

细节和图片

64 字节步幅

Eliminated Possibilities

toplev.py

Details and Pictures

64-byte Stride

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Intel Skylake 上存储循环的意外糟糕且奇怪的双峰性能 [英] Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

问题描述

细节和图片

64 字节步幅

Eliminated Possibilities

toplev.py

Details and Pictures

64-byte Stride

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭