英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能 [英] Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake
问题描述
对于一个有两个存储的简单存储循环,我看到了出乎意料的糟糕性能:一个存储具有16个字节的前向步幅,另一个存储总是位于相同的位置 1 ,如下所示:>
volatile uint32_t value;
void weirdo_cpp(size_t iters, uint32_t* output) {
uint32_t x = value;
uint32_t *rdx = output;
volatile uint32_t *rsi = output;
do {
*rdx = x;
*rsi = x;
rdx += 4; // 16 byte stride
} while (--iters > 0);
}
在组装中,此循环可能 3 如下:
weirdo_cpp:
...
align 16
.top:
mov [rdx], eax ; stride 16
mov [rsi], eax ; never changes
add rdx, 16
dec rdi
jne .top
ret
当访问的存储区位于L2时,我希望它每次迭代的运行时间少于3个周期.第二家商店只是保持相同的位置,应该增加一个周期.第一家商店意味着从L2引入一条线,因此也每四次迭代一次退出线.我不确定您如何评估L2成本,但是即使您保守地估计L1每个周期只能执行以下操作之一:(a)提交存储或(b)从L2接收一行或(c)往L2行驶一行,对于stride-16存储流,您将获得1 + 0.25 + 0.25 = 1.5个周期的信息.
实际上,您注释掉一个存储区时,仅第一个存储区的每次迭代获得〜1.25个周期,而第二个存储区的每次迭代获得〜1.01个周期,因此每个迭代2.5个周期似乎是一个保守的估计.
但是,实际表现却很奇怪.这是测试工具的典型运行:
Estimated CPU speed: 2.60 GHz
output size : 64 KiB
output alignment: 32
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.89 cycles/iter, 1.49 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
4.73 cycles/iter, 1.81 ns/iter, cpu before: 0, cpu after: 0
7.33 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.33 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.34 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.26 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.31 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.29 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.29 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.27 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.30 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.30 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
这里有两件事很奇怪.
首先是双峰时序:有一个快速模式和一个慢速模式.我们从慢速模式开始,每次迭代大约需要7.3个周期,并在某个时候过渡到每次迭代大约3.9个周期.此行为是一致且可重现的,并且两个时间总是围绕两个值聚集在一起.从慢速模式到快速模式的两个方向都显示过渡,反之亦然(有时在一次运行中出现多个过渡).
另一个奇怪的事情是真正的糟糕表现.即使在 fast模式中,在大约3.9个周期内,性能也远比您通过将每个案例与单个存储加在一起(并假设两个商店都处于循环状态时,绝对零工作可以重叠).在慢速模式下,与您基于第一条原则所期望的性能相比,性能糟透了:2个存储要花7.3个周期,如果以L2存储带宽来计算,则大约是<每个L2存储区的strong> 29个周期(因为我们每4次迭代仅存储一条完整的缓存行).
记录为Skylake 在L1和L2之间,比在这里观察到的吞吐量高( )(在慢速模式下,大约2个字节/周期).
是什么解释了吞吐量和双峰性能不佳的原因,我可以避免吗?
我也很好奇这是否会在其他架构上甚至在其他Skylake盒子上重现.随时在评论中包含本地结果.
您可以在github上找到测试代码并加以利用.对于Linux或类似Unix的平台,有一个 在这里,我考虑了一些可能性,并在很大程度上消除了它们.简单的事实消除了许多可能性,您可以在基准测试循环的中间看到性能随机变化,而当许多事情根本没有变化时(例如,如果它与输出有关)数组对齐,因为在整个运行过程中都使用相同的缓冲区,所以它在运行过程中无法更改).我在下面将其称为默认消除(即使对于默认消除而言,通常也要提出另一个论点). 我使用了 toplev.py 来实现英特尔的自上而下分析方法,毫不奇怪,它会将基准标识为商店绑定: 这并没有真正说明问题:我们已经知道这一定是商店搞砸了,但是为什么呢? 英特尔对情况的描述 这是关于L1-L2相互作用涉及的一些问题的合理总结. 2019年2月更新:我不再能够重现表演的双峰"部分:对我来说,在同一个i7-6700HQ盒子上,表演现在总是 在相同情况下非常慢,则应用了非常慢和非常慢的双峰性能,即每行结果大约16-20个周期,如下所示: 此更改似乎已在2018年8月的Skylake微代码更新版本0xC6中引入.先前的微代码0xC2显示了问题中描述的原始行为. 1 这是我原始循环的极大简化的MCVE,它的大小至少是其三倍,并且做了很多其他工作,但是表现出与该简单版本完全相同的性能,但存在瓶颈在同一个神秘问题上. 3 特别是,如果您手动编写程序集,或者如果您使用 4 毫无疑问,您可以通过一些较小的编辑将其转换为MASM语法,因为程序集是如此的琐碎.拉取请求已接受. 到目前为止,我发现了什么.不幸的是,它并没有真正为性能不佳提供解释,也没有为双峰分布提供任何解释,而是更多关于何时可以看到性能的规则以及减轻性能的注意事项: 最初的问题任意使用跨度为16,但让我们从最简单的情况开始:跨度为64,即一条完整的缓存行.事实证明,任何步幅都可以看到各种效果,但是64可以确保每个步幅都有L2高速缓存未命中,因此可以删除一些变量. 让我们现在也删除第二个存储-因此,我们仅测试64K内存上的单个64字节跨步存储: 以与上述相同的方式运行,我得到了大约3.05个周期/存储 2 ,尽管与我以前看到的相比有很多差异(-您甚至可以在那里找到3.0). 因此,我们已经知道,对于仅存储到L2 1 的持续存储,我们可能不会做得比这更好.虽然Skylake显然在L1和L2之间具有64字节的吞吐量,但对于存储流而言,必须为两个从L1的逐出共享该带宽,并将新线路加载到L1中. 3个周期似乎是合理的,如果每个周期要说1个周期,以(a)从L1到L2清除脏的受害者行(b)用L2的新行更新L1,并且(c)将存储提交到L1. 当您添加一次后,在循环中再次写入同一高速缓存行(写入下一个字节,尽管事实无关紧要),会发生什么情况?像这样: 以下是上述循环中测试线束运行1000次的时间的直方图: 因此,大多数时间都聚集在3.5个周期左右.这意味着该额外存储仅增加了0.5个周期的时间.可能就像存储缓冲区能够将两个存储区排到L1一样,但是它们只会发生一半的时间. 请考虑存储缓冲区包含一系列存储,例如 请注意,在3.1左右(而不是3.5左右)存在另一个约6%的结果峰值.那可能是一个稳定的状态,我们总是可以得到幸运的结果.在约4.0-4.1处还有另一个约3%的峰值-总是不走运". 让我们通过考察第一家商店和第二家商店之间的各种差异来检验这一理论: 我们尝试从0到256的所有 我们看到一个特定的模式-白色值快速"(在上面讨论的3.0-4.1值附近,偏移量为1).黄色值较高,最多8个周期,红色值最多10.紫色异常值最高,通常是OP中描述的慢速模式"启动的情况(通常以18.0个周期/iter计时).我们注意到以下内容: 从白细胞的模式中,我们看到,只要第二个存储区相对于第一个存储区位于同一缓存行或下一个,我们就能获得约3.5个周期的快速结果店铺.这与上面的想法一致,即更有效地处理存储到同一缓存行的想法.在下一个高速缓存行中具有第二个存储区的原因是,模式的结果是相同的,除了第一个第一次访问: 紫色离群值"永远不会出现在白色区域,因此显然仅限于已经很慢的场景(这里的慢值使它慢了约2.5倍:从〜8到18个周期) . 我们可以缩小一点并查看更大的偏移量: 相同的基本模式,尽管我们看到随着第二个存储区距离第一个存储区较远(向前或向后),性能有所提高(绿色区域),直到偏移约1700字节为止再次变差.即使在改进的区域中,我们最多只能达到5.8个循环/迭代,仍然比3.5的同线性能差很多. 如果您添加任何种在商店中先执行 3 的加载或预取指令,则整体的慢速性能和慢速模式"异常值都将消失: 您可以将其移植回原始的16步跨度-修复内核循环中的任何类型的预取或加载,对距离几乎不敏感(即使实际上是在之后),已修复问题,您将获得2.3个循环/迭代,接近2.0的最佳可能理想值,并且等于两个具有单独循环的存储的总和. 因此,基本规则是,没有相应负载的情况下存储到L2的速度要比软件预取时慢得多-除非整个存储流以单个顺序模式访问缓存行.这与这样的线性模式永远不会从SW预取中受益的想法背道而驰. 我实际上并没有充实的解释,但其中可能包括以下因素: 0 通常只能在禁用L2流光的情况下才能实现,因为否则L2上的额外争用会使此速度降低到每3.5个周期大约1行. 1 将此与存储区进行了对比,在每个存储区中,我得到的负载正好是1.5个周期,每个周期的隐含带宽约为43个字节.这很合理:L1 <-> L2带宽为64字节,但假设L1 接受来自L2的一行 或为核心的负载请求提供服务每个周期(但不是并行),则有3个周期,分别对不同的L2线进行两次加载:2个周期,接受来自L2的线路,以及1个周期,以满足两条加载指令. 2 并预取了 off .事实证明,L2预取器在检测到流访问时会争夺对L2高速缓存的访问:即使它总是找到候选行并且不进入L3,这会减慢代码的速度并增加可变性.结论通常与预取有关,但是一切都稍慢一些(这里是结果很大启用预取功能-每个负载大约有3.3个周期,但变化很大. 3 甚至根本不需要超前-预取后面的几行内容也可以:我想预取/加载只是在瓶颈之前迅速在商店前面运行,因此无论如何他们都领先.这样,预取是一种自我修复功能,似乎可以处理您输入的几乎所有值. I'm seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that's always to the same location1, like this: In assembly this loop probably3 looks like: When the memory region accessed is in L2 I would expect this to run at less than 3 cycles per iteration. The second store just keeps hitting the same location and should add about a cycle. The first store implies bringing in a line from L2 and hence also evicting a line once every 4 iterations. I'm not sure how you evaluate the L2 cost, but even if you conservatively estimate that the L1 can only do one of the following every cycle: (a) commit a store or (b) receive a line from L2 or (c) evict a line to L2, you'd get something like 1 + 0.25 + 0.25 = 1.5 cycles for the stride-16 store stream. Indeed, you comment out one store you get ~1.25 cycles per iteration for the first store only, and ~1.01 cycles per iteration for the second store, so 2.5 cycles per iteration seems like a conservative estimate. The actual performance is very odd, however. Here's a typical run of the test harness: Two things are weird here. First are the bimodal timings: there is a fast mode and a slow mode. We start out in slow mode taking about 7.3 cycles per iteration, and at some point transition to about 3.9 cycles per iteration. This behavior is consistent and reproducible and the two timings are always quite consistent clustered around the two values. The transition shows up in both directions from slow mode to fast mode and the other way around (and sometimes multiple transitions in one run). The other weird thing is the really bad performance. Even in fast mode, at about 3.9 cycles the performance is much worse than the 1.0 + 1.3 = 2.3 cycles worst cast you'd expect from adding together the each of the cases with a single store (and assuming that absolutely zero worked can be overlapped when both stores are in the loop). In slow mode, performance is terrible compared to what you'd expect based on first principles: it is taking 7.3 cycles to do 2 stores, and if you put it in L2 store bandwidth terms, that's roughly 29 cycles per L2 store (since we only store one full cache line every 4 iterations). Skylake is recorded as having a 64B/cycle throughput between L1 and L2, which is way higher than the observed throughput here (about 2 bytes/cycle in slow mode). What explains the poor throughput and bimodal performance and can I avoid it? I'm also curious if this reproduces on other architectures and even on other Skylake boxes. Feel free to include local results in the comments. You can find the test code and harness on github. There is a Here are some possibilities that I considered and largely eliminated. Many of the possibilities are eliminated by the simple fact that you see the performance transition randomly in the middle of the benchmarking loop, when many things simply haven't changed (e.g., if it was related to the output array alignment, it couldn't change in the middle of a run since the same buffer is used the entire time). I'll refer to this as the default elimination below (even for things that are default elimination there is often another argument to be made). I used toplev.py which implements Intel's Top Down analysis method, and to no surprise it identifies the benchmark as store bound: This doesn't really shed much light: we already knew it must be the stores messing things up, but why? Intel's description of the condition doesn't say much. Here's a reasonable summary of some of the issues involved in L1-L2 interaction. Update Feb 2019: I cannot no longer reproduce the "bimodal" part of the performance: for me, on the same i7-6700HQ box, the performance is now always very slow in the same cases the slow and very slow bimodal performance applies, i.e., with results around 16-20 cycles per line, like this: This change seems to have been introduced in the August 2018 Skylake microcode update, revision 0xC6. The prior microcode, 0xC2 shows the original behavior described in the question. 1 This is a greatly simplified MCVE of my original loop, which was at least 3 times the size and which did lots of additional work, but exhibited exactly the same performance as this simple version, bottlenecked on the same mysterious issue. 3 In particular, it looks exactly like this if you write the assembly by hand, or if you compile it with 4 No doubt you could convert this to MASM syntax with a few minor edits since the assembly is so trivial. Pull requests accepted. What I've found so far. Unfortunately it doesn't really offer an explanation for the poor performance, and not at all for the bimodal distribution, but is more a set of rules for when you might see the performance and notes on mitigating it: The original question arbitrarily used a stride of 16, but let's start with probably the simplest case: a stride of 64, i.e., one full cache line. As it turns out the various effects are visible with any stride, but 64 ensures an L2 cache miss on every stride and so removes some variables. Let's also remove the second store for now - so we're just testing a single 64-byte strided store over 64K of memory: Running this in the same harness as above, I get about 3.05 cycles/store2, although there is quite a bit of variance compared to what I'm used to seeing ( - you can even find a 3.0 in there). So we know already we probably aren't going to do better than this for sustained stores purely to L21. While Skylake apparently has a 64 byte throughput between L1 and L2, in the case of a stream of stores, that bandwidth has to be shared for both evictions from L1, and to load the new line into L1. 3 cycles seems reasonable if it takes say 1 cycle each to (a) evict the dirty victim line from L1 to L2 (b) update L1 with the new line from L2 and (c) commit the store into L1. What happens when you add do a second write to the same cache line (to the next byte, although it turns out not to matter) in the loop? Like this: Here's a histogram of the timing for 1000 runs of the test harness for the above loop: So the majority of times are clustered around 3.5 cycles. That means that this additional store only added 0.5 cycles to the timing. It could be something like the store buffer is able to drain two stores to the L1 if they are in the same line, but this only happens about half the time. Consider that the store buffer contains a series of stores like Note there is another peak of about 6% of results around 3.1 rather than 3.5. That could be a steady state where we always get the lucky outcome. There is another peak of around 3% at ~4.0-4.1 - the "always unlucky" arrangement. Let's test this theory by looking at various offsets between the first and second stores: We try all values of We see a specific pattern - the white values are "fast" (around the 3.0-4.1 values discussed above for the offset of 1). Yellow values are higher, up to 8 cycles, and red up to 10. The purple outliers are the highest and are usually cases where the "slow mode" described in the OP kicks in (usually clocking in a 18.0 cycles/iter). We notice the following: From the pattern of white cells, we see that we get the fast ~3.5 cycle result as long as the second store is in the same cache line or the next relative to the first store. This is consistent with the idea above that stores to the same cache line are handled more efficiently. The reason that having the second store in the next cache line works is that the pattern ends up being the same, except for the first first access: The purple "outliers" are never appear in the white areas, so are apparently restricted to the scenario that is already slow (and the slow more here makes it about 2.5x slower: from ~8 to 18 cycles). We can zoom out a bit and look at even larger offsets: The same basic pattern, although we see that the performance improves (green area) as the second store gets further away (ahead or behind) the first one, up until it gets worse again at an offset of about ~1700 bytes. Even in the improved area we only get to at best 5.8 cycles/iteration still much worse than the same-line performance of 3.5. If you add any kind of load or prefetch instruction that runs ahead3 of the stores, both the overall slow performance and the "slow mode" outliers disappear: You can port this back to the original stride by 16 problem - any type of prefetch or load in the core loop, pretty much insensitive of the distance (even if it's behind in fact), fixes the issue and you get 2.3 cycles/iteration, close to the best possible ideal of 2.0, and equal to the sum of the two stores with separate loops. So the basic rule is that stores to L2 without corresponding loads are much slower than if you software prefetch them - unless the entire store stream accesses cache lines in a single sequential pattern. That's contrary to the idea that a linear pattern like this never benefits from SW prefetch. I don't really have a fleshed out explanation, but it could include these factors: These comments by Dr. McCalpin on the Intel forums are also quite interesting. 0 Mostly only achievable with the L2 streamer disabled since otherwise the additional contention on the L2 slows this down to about 1 line per 3.5 cycles. 1 Contrast this with stores, where I get almost exactly 1.5 cycles per load, for an implied bandwidth of ~43 bytes per cycle. This makes perfect sense: the L1<->L2 bandwith is 64 bytes, but assuming that the L1 is either accepting a line from the L2 or servicing load requests from the core every cycle (but not both in parallel) then you have 3 cycles for two loads to different L2 lines: 2 cycles to accept the lines from L2, and 1 cycle to satisfy two load instructions. 2 With prefetching off. As it turns out, the L2 prefetcher competes for access to the L2 cache when it detects streaming access: even though it always finds the candidate lines and doesn't go to L3, this slows down the code and increases variability. The conclusions generally hold with prefetching on, but everything is just a bit slower (here's a big blob of results with prefetching on - you see about 3.3 cycles per load, but with lots of variability). 3 It doesn't even really need to be ahead - prefetching several lines behind also works: I guess the prefetch/loads just quickly run ahead of the stores which are bottlenecked so they get ahead anyways. In this way, the prefetching is kind of self-healing and seems to work with almost any value you put in. 这篇关于英特尔Skylake上的商店循环出乎意料的糟糕和奇怪的双峰性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!Makefile
,但是在Windows上也应该相对容易构建.如果要运行asm
变体,则需要nasm
或yasm
用于程序集 4 -如果没有,则可以尝试C ++版本. 4 >
排除的可能性
stress -vm 4
)或多或少都可以观察到相同的效果.基准本身无论如何都应该完全是核心本地的,因为它适合L2,并且perf
确认每次迭代很少发生L2丢失(每300-400次迭代大约1次丢失),可能与printf
代码有关. /li>
performance
模式下,性能调节器为intel_pstate
.测试期间未观察到频率变化(CPU基本上锁定在2.59 GHz上).perf
没有报告任何特别奇怪的TLB行为. toplev.py
BE Backend_Bound: 82.11 % Slots [ 4.83%]
BE/Mem Backend_Bound.Memory_Bound: 59.64 % Slots [ 4.83%]
BE/Core Backend_Bound.Core_Bound: 22.47 % Slots [ 4.83%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 0.03 % Stalls [ 4.92%]
This metric estimates how often the CPU was stalled without
loads missing the L1 data cache...
Sampling events: mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 74.91 % Stalls [ 4.96%] <==
This metric estimates how often CPU was stalled due to
store memory accesses...
Sampling events: mem_inst_retired.all_stores:pp
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 28.20 % Clocks [ 4.93%]
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized: 26.28 % CoreClocks [ 4.83%]
This metric represents Core cycles fraction where the CPU
executed total of 1 uop per cycle on all execution ports...
MUX: 4.65 %
PerfMon Event Multiplexing accuracy indicator
gcc -O1
(版本5.4.1)进行编译,则它看起来完全像 一样,并且可能是最合理的编译器(使用volatile
可以避免将大部分死掉的第二个存储区沉入循环之外).
详细信息和图片
64字节跨度
top:
mov BYTE PTR [rdx],al
add rdx,0x40
sub rdi,0x1
jne top
top:
mov BYTE PTR [rdx],al
mov BYTE PTR [rdx+0x1],al
add rdx,0x40
sub rdi,0x1
jne top
count cycles/itr
1 3.0
51 3.1
5 3.2
5 3.3
12 3.4
733 3.5
139 3.6
22 3.7
2 3.8
11 4.0
16 4.1
1 4.3
2 4.4
1, 1, 2, 2, 3, 3
,其中1
表示高速缓存行:一半位置具有来自同一高速缓存行的两个连续值,而另一半则没有.当存储缓冲区正在等待耗尽存储空间,并且L1忙于从L2撤离并接受来自L2的行时,L1将在任意"点可用于存储,如果它位于1, 1
位置,则可能是在一个周期内存储消耗,但如果在1, 2
处,则需要两个周期.top:
mov BYTE PTR [rdx + FIRST],al
mov BYTE PTR [rdx + SECOND],al
add rdx,0x40
sub rdi,0x1
jne top
FIRST
和SECOND
值,以8为步长.结果是,在垂直轴上更改FIRST
值,在水平轴上更改SECOND
:
0, 0, 1, 1, 2, 2, ...
vs 0, 1, 1, 2, 2, ...
-在第二种情况下,第二个存储区是第一个存储区触摸每条缓存行.但是存储缓冲区无关紧要.一旦进入不同的缓存行,就会得到类似0, 2, 1, 3, 2, ...
的模式,显然这很糟糕?
volatile uint32_t value;
void weirdo_cpp(size_t iters, uint32_t* output) {
uint32_t x = value;
uint32_t *rdx = output;
volatile uint32_t *rsi = output;
do {
*rdx = x;
*rsi = x;
rdx += 4; // 16 byte stride
} while (--iters > 0);
}
weirdo_cpp:
...
align 16
.top:
mov [rdx], eax ; stride 16
mov [rsi], eax ; never changes
add rdx, 16
dec rdi
jne .top
ret
Estimated CPU speed: 2.60 GHz
output size : 64 KiB
output alignment: 32
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
3.89 cycles/iter, 1.49 ns/iter, cpu before: 0, cpu after: 0
3.90 cycles/iter, 1.50 ns/iter, cpu before: 0, cpu after: 0
4.73 cycles/iter, 1.81 ns/iter, cpu before: 0, cpu after: 0
7.33 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.33 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.34 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.26 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.31 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.29 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.29 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.27 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.30 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.30 cycles/iter, 2.81 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
7.28 cycles/iter, 2.80 ns/iter, cpu before: 0, cpu after: 0
Makefile
for Linux or Unix-like platforms, but it should be relatively easy to build on Windows too. If you want to run the asm
variant you'll need nasm
or yasm
for the assembly4 - if you don't have that you can just try the C++ version.Eliminated Possibilities
stress -vm 4
). The benchmark itself should be completely core-local anyways since it fits in L2, and perf
confirms there are very few L2 misses per iteration (about 1 miss every 300-400 iterations, probably related to the printf
code).intel_pstate
in performance
mode. No frequency variations are observed during the test (CPU stays essentially locked at 2.59 GHz).perf
doesn't report any particularly weird TLB behavior.toplev.py
BE Backend_Bound: 82.11 % Slots [ 4.83%]
BE/Mem Backend_Bound.Memory_Bound: 59.64 % Slots [ 4.83%]
BE/Core Backend_Bound.Core_Bound: 22.47 % Slots [ 4.83%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 0.03 % Stalls [ 4.92%]
This metric estimates how often the CPU was stalled without
loads missing the L1 data cache...
Sampling events: mem_load_retired.l1_hit:pp mem_load_retired.fb_hit:pp
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 74.91 % Stalls [ 4.96%] <==
This metric estimates how often CPU was stalled due to
store memory accesses...
Sampling events: mem_inst_retired.all_stores:pp
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 28.20 % Clocks [ 4.93%]
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized: 26.28 % CoreClocks [ 4.83%]
This metric represents Core cycles fraction where the CPU
executed total of 1 uop per cycle on all execution ports...
MUX: 4.65 %
PerfMon Event Multiplexing accuracy indicator
gcc -O1
(version 5.4.1), and probably most reasonable compilers (volatile
is used to avoid sinking the mostly-dead second store outside the loop).
Details and Pictures
64-byte Stride
top:
mov BYTE PTR [rdx],al
add rdx,0x40
sub rdi,0x1
jne top
top:
mov BYTE PTR [rdx],al
mov BYTE PTR [rdx+0x1],al
add rdx,0x40
sub rdi,0x1
jne top
count cycles/itr
1 3.0
51 3.1
5 3.2
5 3.3
12 3.4
733 3.5
139 3.6
22 3.7
2 3.8
11 4.0
16 4.1
1 4.3
2 4.4
1, 1, 2, 2, 3, 3
where 1
indicates the cache line: half of the positions have two consecutive values from the same cache line and half don't. As the store buffer is waiting to drain stores, and the L1 is busily evicting to and accepting lines from L2, the L1 will come available for a store at an "arbitrary" point, and if it is at the position 1, 1
maybe the stores drain in one cycle, but if it's at 1, 2
it takes two cycles.top:
mov BYTE PTR [rdx + FIRST],al
mov BYTE PTR [rdx + SECOND],al
add rdx,0x40
sub rdi,0x1
jne top
FIRST
and SECOND
from 0 to 256 in steps of 8. The results, with varying FIRST
values on the vertical axis and SECOND
on the horizontal:
0, 0, 1, 1, 2, 2, ...
vs 0, 1, 1, 2, 2, ...
- where in the second case it is the second store that first touches each cache line. The store buffer doesn't care though. As soon as you get into different cache lines, you get a pattern like 0, 2, 1, 3, 2, ...
and apparently this sucks?