32 字节对齐例程不适合 uops 缓存 [英] 32-byte aligned routine does not fit the uops cache
问题描述
KbL i7-8550U
我正在研究 uops-cache 的行为,但遇到了一个误解.
如英特尔优化手册 2.5.2.2
(我的)中所述:
解码的 ICache 包含 32 个集合.每组包含八种方式.每条路最多可容纳六个微操作.
-
<块引用>所有微操作都代表静态的指令在代码中是连续的,并且它们的 EIP 在同一个对齐中32 字节区域.
-
<块引用>最多三种方式可以专用于同一个 32 字节对齐的块,允许每个 32 字节区域总共缓存 18 个微操作原始 IA 程序.
-
<块引用>无条件分支是方式中的最后一个微操作.
<小时>
案例 1:
考虑以下例程:
uop.h
void prevent_uops_cache(size_t);
uop.S
对齐32hibit_uops_cache:mov edx, esimov edx, esimov edx, esimov edx, esimov edx, esimov edx, esijmp decrement_jmp_tgtdecrement_jmp_tgt:十二月ja prevent_uops_cache ;ja 是为了避免宏融合退
为了确保例程的代码实际上是 32 字节对齐的,这里是 asm
0x555555554820 mov edx,esi0x555555554822 <inhibit_uops_cache+2>mov edx,esi0x555555554824 <inhibit_uops_cache+4>mov edx,esi0x555555554826 <inhibit_uops_cache+6>mov edx,esi0x555555554828 <inhibit_uops_cache+8>mov edx,esi0x55555555482a <inhibit_uops_cache+10>mov edx,esi0x55555555482c <inhibit_uops_cache+12>jmp 0x55555555482e <decrement_jmp_tgt>0x55555555482e <decrement_jmp_tgt>十二月0x555555554831 <decrement_jmp_tgt+3>ja 0x555555554820 <inhibit_uops_cache>0x555555554833 <decrement_jmp_tgt+5>退0x555555554834 <decrement_jmp_tgt+6>没有0x555555554835 <decrement_jmp_tgt+7>没有0x555555554836 <decrement_jmp_tgt+8>没有0x555555554837 <decrement_jmp_tgt+9>没有0x555555554838 <decrement_jmp_tgt+10>没有0x555555554839 <decrement_jmp_tgt+11>没有0x55555555483a <decrement_jmp_tgt+12>没有0x55555555483b <decrement_jmp_tgt+13>没有0x55555555483c 没有0x55555555483d 没有0x55555555483e <decrement_jmp_tgt+16>没有0x55555555483f <decrement_jmp_tgt+17>没有
运行方式
int main(void){抑制 uops_cache(4096 * 4096 * 128L);}
我得到了计数器
'./bin' 的性能计数器统计信息:6431201748 idq.dsb_cycles (56,91%)19175741518 idq.dsb_uops (57,13%)7866687 idq.mite_uops (57,36%)3954421 idq.ms_uops (57,46%)560459 dsb2mite_switches.penalty_cycles (57,28%)884486 frontend_retired.dsb_miss (57,05%)6782598787 个周期 (56,82%)1,749000366 秒时间过去1,748985000 秒用户0,000000000 秒系统
这正是我所期望的.
绝大多数 uops 来自 uops 缓存.uops 数字也完全符合我的期望
mov edx, esi - 1 uop;jmp imm - 1 uop;靠近十二月 rdi - 1 uop;ja - 1 uop;靠近
4096 * 4096 * 128 * 9 = 19327352832
大约等于计数器 19326755442 + 3836395 + 1642975
案例 2:
考虑inhibit_uops_cache
的实现,它与注释掉的一条指令不同:
对齐32hibit_uops_cache:mov edx, esimov edx, esimov edx, esimov edx, esimov edx, esi;mov edx, esijmp decrement_jmp_tgtdecrement_jmp_tgt:十二月ja prevent_uops_cache ;ja 是为了避免宏融合退
疾病:
0x555555554820 mov edx,esi0x555555554822 <inhibit_uops_cache+2>mov edx,esi0x555555554824 <inhibit_uops_cache+4>mov edx,esi0x555555554826 <inhibit_uops_cache+6>mov edx,esi0x555555554828 <inhibit_uops_cache+8>mov edx,esi0x55555555482a <inhibit_uops_cache+10>jmp 0x55555555482c 0x55555555482c 十二月0x55555555482f <decrement_jmp_tgt+3>ja 0x555555554820 <inhibit_uops_cache>0x555555554831 <decrement_jmp_tgt+5>退0x555555554832 <decrement_jmp_tgt+6>没有0x555555554833 <decrement_jmp_tgt+7>没有0x555555554834 <decrement_jmp_tgt+8>没有0x555555554835 <decrement_jmp_tgt+9>没有0x555555554836 <decrement_jmp_tgt+10>没有0x555555554837 <decrement_jmp_tgt+11>没有0x555555554838 <decrement_jmp_tgt+12>没有0x555555554839 <decrement_jmp_tgt+13>没有0x55555555483a <decrement_jmp_tgt+14>没有0x55555555483b <decrement_jmp_tgt+15>没有0x55555555483c 没有0x55555555483d 没有0x55555555483e <decrement_jmp_tgt+18>没有0x55555555483f <decrement_jmp_tgt+19>没有
运行方式
int main(void){抑制 uops_cache(4096 * 4096 * 128L);}
我得到了计数器
'./bin' 的性能计数器统计信息:2464970970 idq.dsb_cycles (56,93%)6197024207 idq.dsb_uops (57,01%)10845763859 idq.mite_uops (57,19%)3022089 idq.ms_uops (57,38%)321614 dsb2mite_switches.penalty_cycles (57,35%)1733465236 frontend_retired.dsb_miss (57,16%)8405643642 次循环 (56,97%)2,117538141 秒时间过去2,117511000 秒用户0,000000000 秒系统
计数器完全出乎意料.
我希望所有的 uops 都像以前一样来自 dsb,因为例程符合 uops 缓存的要求.
相比之下,几乎 70% 的 uops 来自 Legacy Decode Pipeline.
问题:案例 2 有什么问题?要了解正在发生的事情,需要查看哪些计数器?
<小时>UPD: 按照@PeterCordes 的想法,我检查了无条件分支目标 decrement_jmp_tgt
的 32 字节对齐.结果如下:
案例 3:
将 onconditional jump
目标对齐到 32 字节,如下所示
对齐32hibit_uops_cache:mov edx, esimov edx, esimov edx, esimov edx, esimov edx, esi;mov edx, esijmp decrement_jmp_tgt对齐 32 ;align 16 不会改变任何东西decrement_jmp_tgt:十二月ja prevent_uops_cache退
疾病:
0x555555554820 mov edx,esi0x555555554822 <inhibit_uops_cache+2>mov edx,esi0x555555554824 <inhibit_uops_cache+4>mov edx,esi0x555555554826 <inhibit_uops_cache+6>mov edx,esi0x555555554828 <inhibit_uops_cache+8>mov edx,esi0x55555555482a <inhibit_uops_cache+10>jmp 0x555555554840 <decrement_jmp_tgt>#nops 满足对齐0x555555554840 <decrement_jmp_tgt>十二月0x555555554843 <decrement_jmp_tgt+3>ja 0x555555554820 <inhibit_uops_cache>0x555555554845 <decrement_jmp_tgt+5>退
并运行为
int main(void){抑制 uops_cache(4096 * 4096 * 128L);}
我得到以下计数器
'./bin' 的性能计数器统计信息:4296298295 idq.dsb_cycles (57,19%)17145751147 idq.dsb_uops (57,32%)45834799 idq.mite_uops (57,32%)1896769 idq.ms_uops (57,32%)136865 dsb2mite_switches.penalty_cycles (57,04%)161314 frontend_retired.dsb_miss (56,90%)4319137397 个周期 (56,91%)1,096792233 秒时间过去1,096759000 秒用户0,000000000 秒系统
结果在意料之中.超过 99% 的 uops 来自 dsb.
平均 dsb uops 交付率 = 17145751147/4296298295
= 3.99
接近峰值带宽.
这不是 OP 问题的答案,而是需要注意的问题
请参阅
你代码中 ja 的最后一个字节是 ...30
,所以它是罪魁祸首.
如果这是一个 32 字节的边界,而不仅仅是 16,那么我们就会遇到问题:
0x55555555482a jmp#很好0x55555555482c 十二月0x55555555482f <decrement_jmp_tgt+3>ja # 跨越 16B 边界(不是 32)0x555555554831 <decrement_jmp_tgt+5>ret # 很好
这部分没有完全更新,还在讨论跨越32B边界
JA 本身跨越了一个边界.
插入一个 NOP after dec rdi
应该可以工作,将 2 字节 ja
完全放在边界之后,并带有一个新的 32 字节块.无论如何,dec/ja 的宏融合是不可能的,因为 JA 读取 CF(和 ZF)但 DEC 不写入 CF.
使用 sub rdi, 1
移动 JA 不起作用;它会进行宏融合,与该指令对应的 x86 代码的组合 6 字节仍将跨越边界.
您可以在 jmp
之前使用单字节 nops 而不是 mov
来更早地移动所有内容,如果在块的最后一个字节之前将所有内容全部放入.
ASLR 可以更改虚拟页面代码的执行位置(地址的第 12 位和更高位),但不能更改页面内的对齐方式或相对于缓存线的对齐方式.所以我们在一种情况下在反汇编中看到的情况每次都会发生.
KbL i7-8550U
I'm researching the behavior of uops-cache and came across a misunderstanding regarding it.
As specified in the Intel Optimization Manual 2.5.2.2
(emp. mine):
The Decoded ICache consists of 32 sets. Each set contains eight Ways. Each Way can hold up to six micro-ops.
-
All micro-ops in a Way represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region.
-
Up to three Ways may be dedicated to the same 32-byte aligned chunk, allowing a total of 18 micro-ops to be cached per 32-byte region of the original IA program.
-
A non-conditional branch is the last micro-op in a Way.
CASE 1:
Consider the following routine:
uop.h
void inhibit_uops_cache(size_t);
uop.S
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
jmp decrement_jmp_tgt
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
ret
To make sure that the code of the routine is actually 32-bytes aligned here is the asm
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> mov edx,esi
0x55555555482c <inhibit_uops_cache+12> jmp 0x55555555482e <decrement_jmp_tgt>
0x55555555482e <decrement_jmp_tgt> dec rdi
0x555555554831 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554833 <decrement_jmp_tgt+5> ret
0x555555554834 <decrement_jmp_tgt+6> nop
0x555555554835 <decrement_jmp_tgt+7> nop
0x555555554836 <decrement_jmp_tgt+8> nop
0x555555554837 <decrement_jmp_tgt+9> nop
0x555555554838 <decrement_jmp_tgt+10> nop
0x555555554839 <decrement_jmp_tgt+11> nop
0x55555555483a <decrement_jmp_tgt+12> nop
0x55555555483b <decrement_jmp_tgt+13> nop
0x55555555483c <decrement_jmp_tgt+14> nop
0x55555555483d <decrement_jmp_tgt+15> nop
0x55555555483e <decrement_jmp_tgt+16> nop
0x55555555483f <decrement_jmp_tgt+17> nop
running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the counters
Performance counter stats for './bin':
6 431 201 748 idq.dsb_cycles (56,91%)
19 175 741 518 idq.dsb_uops (57,13%)
7 866 687 idq.mite_uops (57,36%)
3 954 421 idq.ms_uops (57,46%)
560 459 dsb2mite_switches.penalty_cycles (57,28%)
884 486 frontend_retired.dsb_miss (57,05%)
6 782 598 787 cycles (56,82%)
1,749000366 seconds time elapsed
1,748985000 seconds user
0,000000000 seconds sys
This is exactly what I expected to get.
The vast majority of uops came from uops cache. Also uops number perfectly matches with my expectation
mov edx, esi - 1 uop;
jmp imm - 1 uop; near
dec rdi - 1 uop;
ja - 1 uop; near
4096 * 4096 * 128 * 9 = 19 327 352 832
approximately equal to the counters 19 326 755 442 + 3 836 395 + 1 642 975
CASE 2:
Consider the implementation of inhibit_uops_cache
which is different by one instruction commented out:
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
; mov edx, esi
jmp decrement_jmp_tgt
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
ret
disas:
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> jmp 0x55555555482c <decrement_jmp_tgt>
0x55555555482c <decrement_jmp_tgt> dec rdi
0x55555555482f <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554831 <decrement_jmp_tgt+5> ret
0x555555554832 <decrement_jmp_tgt+6> nop
0x555555554833 <decrement_jmp_tgt+7> nop
0x555555554834 <decrement_jmp_tgt+8> nop
0x555555554835 <decrement_jmp_tgt+9> nop
0x555555554836 <decrement_jmp_tgt+10> nop
0x555555554837 <decrement_jmp_tgt+11> nop
0x555555554838 <decrement_jmp_tgt+12> nop
0x555555554839 <decrement_jmp_tgt+13> nop
0x55555555483a <decrement_jmp_tgt+14> nop
0x55555555483b <decrement_jmp_tgt+15> nop
0x55555555483c <decrement_jmp_tgt+16> nop
0x55555555483d <decrement_jmp_tgt+17> nop
0x55555555483e <decrement_jmp_tgt+18> nop
0x55555555483f <decrement_jmp_tgt+19> nop
running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the counters
Performance counter stats for './bin':
2 464 970 970 idq.dsb_cycles (56,93%)
6 197 024 207 idq.dsb_uops (57,01%)
10 845 763 859 idq.mite_uops (57,19%)
3 022 089 idq.ms_uops (57,38%)
321 614 dsb2mite_switches.penalty_cycles (57,35%)
1 733 465 236 frontend_retired.dsb_miss (57,16%)
8 405 643 642 cycles (56,97%)
2,117538141 seconds time elapsed
2,117511000 seconds user
0,000000000 seconds sys
The counters are completely unexpected.
I expected all the uops come from dsb as before since the routine matches the requirements of uops cache.
By contrast, almost 70% of uops came from Legacy Decode Pipeline.
QUESTION: What's wrong with the CASE 2? What counters to look at to understand what's going on?
UPD: Following @PeterCordes idea I checked the 32-byte alignment of the unconditional branch target decrement_jmp_tgt
. Here is the result:
CASE 3:
Aligning onconditional jump
target to 32 byte as follows
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
; mov edx, esi
jmp decrement_jmp_tgt
align 32 ; align 16 does not change anything
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache
ret
disas:
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> jmp 0x555555554840 <decrement_jmp_tgt>
#nops to meet the alignment
0x555555554840 <decrement_jmp_tgt> dec rdi
0x555555554843 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554845 <decrement_jmp_tgt+5> ret
and running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the following counters
Performance counter stats for './bin':
4 296 298 295 idq.dsb_cycles (57,19%)
17 145 751 147 idq.dsb_uops (57,32%)
45 834 799 idq.mite_uops (57,32%)
1 896 769 idq.ms_uops (57,32%)
136 865 dsb2mite_switches.penalty_cycles (57,04%)
161 314 frontend_retired.dsb_miss (56,90%)
4 319 137 397 cycles (56,91%)
1,096792233 seconds time elapsed
1,096759000 seconds user
0,000000000 seconds sys
The result is perfectly expected. More then 99% of the uops came from dsb.
Avg dsb uops delivery rate = 17 145 751 147 / 4 296 298 295
= 3.99
Which is close to the peak bandwith.
This is not the answer to the OP's problem, but is one to watch out for
See Code alignment dramatically affects performance for compiler options to work around this performance pothole Intel introduced into Skylake-derived CPUs, as part of this workaround.
Other observations: the block of 6 mov
instructions should fill a uop cache line, with jmp
in a line by itself. In case 2, the 5 mov
+ jmp
should fit in one cache line (or more properly "way").
(Posting this for the benefit of future readers who might have the same symptoms but a different cause. I realized right as I finished writing it that 0x...30
is not a 32-byte boundary, only 0x...20
and 40
, so this erratum shouldn't be the problem for the code in the question.)
A recent (late 2019) microcode update introduced a new performance pothole. It works around Intel's JCC erratum on Skylake-derived microarchitectures. (KBL142 on your Kaby-Lake specifically).
Microcode Update (MCU) to Mitigate JCC Erratum
This erratum can be prevented by a microcode update (MCU). The MCU prevents jump instructions from being cached in the Decoded ICache when the jump instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In this context, Jump Instructions include all jump types: conditional jump (Jcc), macrofused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct unconditional jump, indirect jump, direct/indirect call, and return.
Intel's whitepaper also includes a diagram of cases that trigger this non-uop-cacheable effect. (PDF screenshot borrowed from a Phoronix article with benchmarks before/after, and after with rebuilding with some workarounds in GCC/GAS that try to avoid this new performance pitfall).
The last byte of the ja in your code is ...30
, so it's the culprit.
If this was a 32-byte boundary, not just 16, then we'd have the problem here:
0x55555555482a <inhibit_uops_cache+10> jmp # fine
0x55555555482c <decrement_jmp_tgt> dec rdi
0x55555555482f <decrement_jmp_tgt+3> ja # spans 16B boundary (not 32)
0x555555554831 <decrement_jmp_tgt+5> ret # fine
This section not fully updated, still talking about spanning a 32B boundary
JA itself spans a boundary.
Inserting a NOP after dec rdi
should work, putting the 2-byte ja
fully after the boundary with a new 32-byte chunk. Macro-fusion of dec/ja wasn't possible anyway because JA reads CF (and ZF) but DEC doesn't write CF.
Using sub rdi, 1
to move the JA would not work; it would macro-fuse, and the combined 6 bytes of x86 code corresponding to that instruction would still span the boundary.
You could use single-byte nops instead of mov
before the jmp
to move everything earlier, if that gets it all in before the last byte of a block.
ASLR can change what virtual page code executes from (bit 12 and higher of the address), but not the alignment within a page or relative to a cache line. So what we see in disassembly in one case will happen every time.
这篇关于32 字节对齐例程不适合 uops 缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!