32字节对齐的例程不适合uops缓存 [英] 32-byte aligned routine does not fit the uops cache
问题描述
KbL i7-8550U
我正在研究uops-cache的行为,并且遇到了关于它的误解.
I'm researching the behavior of uops-cache and came across a misunderstanding regarding it.
如《英特尔优化手册》 2.5.2.2
(我的矿山)中所指定:
As specified in the Intel Optimization Manual 2.5.2.2
(emp. mine):
解码的ICache由32组组成.每套包含八种方式. 每种方式最多可容纳六个微操作.
The Decoded ICache consists of 32 sets. Each set contains eight Ways. Each Way can hold up to six micro-ops.
-
所有微操作都以某种方式表示静态的指令 在代码中是连续的,并且它们的EIP在同一行内 32个字节的区域.
All micro-ops in a Way represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region.
-
最多可以将三种方式专用于相同的32字节对齐块, 允许在每个32字节的区域中总共缓存18个微操作 原始的IA程序.
Up to three Ways may be dedicated to the same 32-byte aligned chunk, allowing a total of 18 micro-ops to be cached per 32-byte region of the original IA program.
-
无条件分支是某种方式中的最后一个微操作.
A non-conditional branch is the last micro-op in a Way.
案例1:
请考虑以下例程:
uop.h
void inhibit_uops_cache(size_t);
uop.S
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
jmp decrement_jmp_tgt
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
ret
为确保例程的代码实际上是32字节对齐,这里是asm
To make sure that the code of the routine is actually 32-bytes aligned here is the asm
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> mov edx,esi
0x55555555482c <inhibit_uops_cache+12> jmp 0x55555555482e <decrement_jmp_tgt>
0x55555555482e <decrement_jmp_tgt> dec rdi
0x555555554831 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554833 <decrement_jmp_tgt+5> ret
0x555555554834 <decrement_jmp_tgt+6> nop
0x555555554835 <decrement_jmp_tgt+7> nop
0x555555554836 <decrement_jmp_tgt+8> nop
0x555555554837 <decrement_jmp_tgt+9> nop
0x555555554838 <decrement_jmp_tgt+10> nop
0x555555554839 <decrement_jmp_tgt+11> nop
0x55555555483a <decrement_jmp_tgt+12> nop
0x55555555483b <decrement_jmp_tgt+13> nop
0x55555555483c <decrement_jmp_tgt+14> nop
0x55555555483d <decrement_jmp_tgt+15> nop
0x55555555483e <decrement_jmp_tgt+16> nop
0x55555555483f <decrement_jmp_tgt+17> nop
运行为
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
我有柜台
Performance counter stats for './bin':
6 431 201 748 idq.dsb_cycles (56,91%)
19 175 741 518 idq.dsb_uops (57,13%)
7 866 687 idq.mite_uops (57,36%)
3 954 421 idq.ms_uops (57,46%)
560 459 dsb2mite_switches.penalty_cycles (57,28%)
884 486 frontend_retired.dsb_miss (57,05%)
6 782 598 787 cycles (56,82%)
1,749000366 seconds time elapsed
1,748985000 seconds user
0,000000000 seconds sys
这正是我期望得到的.
绝大多数uops来自uops缓存. uops数字也完全符合我的期望
The vast majority of uops came from uops cache. Also uops number perfectly matches with my expectation
mov edx, esi - 1 uop;
jmp imm - 1 uop; near
dec rdi - 1 uop;
ja - 1 uop; near
4096 * 4096 * 128 * 9 = 19 327 352 832
大约等于计数器19 326 755 442 + 3 836 395 + 1 642 975
案例2:
考虑inhibit_uops_cache
的实现,这与注释掉的一条指令不同:
Consider the implementation of inhibit_uops_cache
which is different by one instruction commented out:
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
; mov edx, esi
jmp decrement_jmp_tgt
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
ret
问题:
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> jmp 0x55555555482c <decrement_jmp_tgt>
0x55555555482c <decrement_jmp_tgt> dec rdi
0x55555555482f <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554831 <decrement_jmp_tgt+5> ret
0x555555554832 <decrement_jmp_tgt+6> nop
0x555555554833 <decrement_jmp_tgt+7> nop
0x555555554834 <decrement_jmp_tgt+8> nop
0x555555554835 <decrement_jmp_tgt+9> nop
0x555555554836 <decrement_jmp_tgt+10> nop
0x555555554837 <decrement_jmp_tgt+11> nop
0x555555554838 <decrement_jmp_tgt+12> nop
0x555555554839 <decrement_jmp_tgt+13> nop
0x55555555483a <decrement_jmp_tgt+14> nop
0x55555555483b <decrement_jmp_tgt+15> nop
0x55555555483c <decrement_jmp_tgt+16> nop
0x55555555483d <decrement_jmp_tgt+17> nop
0x55555555483e <decrement_jmp_tgt+18> nop
0x55555555483f <decrement_jmp_tgt+19> nop
运行为
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
我有柜台
Performance counter stats for './bin':
2 464 970 970 idq.dsb_cycles (56,93%)
6 197 024 207 idq.dsb_uops (57,01%)
10 845 763 859 idq.mite_uops (57,19%)
3 022 089 idq.ms_uops (57,38%)
321 614 dsb2mite_switches.penalty_cycles (57,35%)
1 733 465 236 frontend_retired.dsb_miss (57,16%)
8 405 643 642 cycles (56,97%)
2,117538141 seconds time elapsed
2,117511000 seconds user
0,000000000 seconds sys
计数器完全出乎意料.
我希望所有uops都像以前一样来自dsb,因为该例程符合uops缓存的要求.
相比之下,几乎70%的广告来自旧版解码管道".
问题: CASE 2有什么问题?要了解发生了什么情况需要查看哪些计数器?
QUESTION: What's wrong with the CASE 2? What counters to look at to understand what's going on?
UPD:按照@PeterCordes的想法,我检查了无条件分支目标decrement_jmp_tgt
的32字节对齐方式.结果如下:
UPD: Following @PeterCordes idea I checked the 32-byte alignment of the unconditional branch target decrement_jmp_tgt
. Here is the result:
案例3:
按条件将有条件的jump
目标对齐为32个字节
Aligning onconditional jump
target to 32 byte as follows
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
; mov edx, esi
jmp decrement_jmp_tgt
align 32 ; align 16 does not change anything
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache
ret
问题:
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> jmp 0x555555554840 <decrement_jmp_tgt>
#nops to meet the alignment
0x555555554840 <decrement_jmp_tgt> dec rdi
0x555555554843 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554845 <decrement_jmp_tgt+5> ret
并以
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
我得到了以下计数器
Performance counter stats for './bin':
4 296 298 295 idq.dsb_cycles (57,19%)
17 145 751 147 idq.dsb_uops (57,32%)
45 834 799 idq.mite_uops (57,32%)
1 896 769 idq.ms_uops (57,32%)
136 865 dsb2mite_switches.penalty_cycles (57,04%)
161 314 frontend_retired.dsb_miss (56,90%)
4 319 137 397 cycles (56,91%)
1,096792233 seconds time elapsed
1,096759000 seconds user
0,000000000 seconds sys
完全可以预期结果.超过99%的微词来自dsb.
dsb的平均投放速度= 17 145 751 147 / 4 296 298 295
= 3.99
Avg dsb uops delivery rate = 17 145 751 147 / 4 296 298 295
= 3.99
哪个接近峰值带宽.
推荐答案
这不是OP的问题的答案,而是要提防的
其他观察结果:6条mov
指令的块应填充uop缓存行,而jmp
本身应位于一行中.在第2种情况下,5个mov
+ jmp
应该适合一个高速缓存行(或更恰当地说是"way").
This is not the answer to the OP's problem, but is one to watch out for
Other observations: the block of 6 mov
instructions should fill a uop cache line, with jmp
in a line by itself. In case 2, the 5 mov
+ jmp
should fit in one cache line (or more properly "way").
(发布此文章是为了让将来可能有相同症状但原因不同的读者受益.我在写完《 0x...30
》不是 32个字节的边界,只有0x...20
和40
,所以这个错误应该不是问题代码的问题.)
(Posting this for the benefit of future readers who might have the same symptoms but a different cause. I realized right as I finished writing it that 0x...30
is not a 32-byte boundary, only 0x...20
and 40
, so this erratum shouldn't be the problem for the code in the question.)
最近(2019年末)的微代码更新引入了一个新的性能隐患.该问题适用于基于Skylake派生的微体系结构的英特尔JCC勘误表. (特别是您的Kaby-Lake上的KBL142).
A recent (late 2019) microcode update introduced a new performance pothole. It works around Intel's JCC erratum on Skylake-derived microarchitectures. (KBL142 on your Kaby-Lake specifically).
微码更新(MCU)以减轻JCC勘误
可以通过微码更新(MCU)防止这种错误. MCU 防止 跳转时跳转指令从缓存在解码的ICache中 指令越过32字节边界或结束于32字节边界.在 在这种情况下,跳转指令包括所有跳转类型:条件跳转(Jcc),宏融合的op-Jcc(其中op是cmp,test,add,sub和in,inc或dec之一),直接 无条件跳转,间接跳转,直接/间接调用和返回.
This erratum can be prevented by a microcode update (MCU). The MCU prevents jump instructions from being cached in the Decoded ICache when the jump instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In this context, Jump Instructions include all jump types: conditional jump (Jcc), macrofused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct unconditional jump, indirect jump, direct/indirect call, and return.
英特尔白皮书还包括了触发这种非uop可缓存效果的情况的图表. (从 Phoonix文章中借用的PDF屏幕截图之前/之后和之后的基准测试,以及在GCC/GAS中尝试避免这种新性能陷阱的一些变通方法.
Intel's whitepaper also includes a diagram of cases that trigger this non-uop-cacheable effect. (PDF screenshot borrowed from a Phoronix article with benchmarks before/after, and after with rebuilding with some workarounds in GCC/GAS that try to avoid this new performance pitfall).
代码中ja的最后一个字节是 ...30
,因此是罪魁祸首.
The last byte of the ja in your code is ...30
, so it's the culprit.
如果这是一个32字节的边界,而不仅仅是16个字节,那么我们这里就会遇到问题:
If this was a 32-byte boundary, not just 16, then we'd have the problem here:
0x55555555482a <inhibit_uops_cache+10> jmp # fine
0x55555555482c <decrement_jmp_tgt> dec rdi
0x55555555482f <decrement_jmp_tgt+3> ja # spans 16B boundary (not 32)
0x555555554831 <decrement_jmp_tgt+5> ret # fine
本节尚未完全更新,仍在讨论跨越32B边界
JA本身跨越了一个边界.
JA itself spans a boundary.
在 dec rdi
之后插入NOP 应该可以,将2字节的ja
完全放在带有新的32字节块的边界之后.无论如何,不可能实现dec/ja的宏融合,因为JA读取CF(和ZF),但是DEC不写入CF.
Inserting a NOP after dec rdi
should work, putting the 2-byte ja
fully after the boundary with a new 32-byte chunk. Macro-fusion of dec/ja wasn't possible anyway because JA reads CF (and ZF) but DEC doesn't write CF.
使用sub rdi, 1
移动JA不会 ;它会进行宏融合,并且与该指令相对应的x86代码的组合6字节仍将跨越边界.
Using sub rdi, 1
to move the JA would not work; it would macro-fuse, and the combined 6 bytes of x86 code corresponding to that instruction would still span the boundary.
您可以在jmp
之前使用单字节nops代替mov
,以便将所有内容移到更早的位置,如果可以将其全部移入块的最后一个字节之前.
You could use single-byte nops instead of mov
before the jmp
to move everything earlier, if that gets it all in before the last byte of a block.
ASLR可以更改从(地址的第12位及更高位)执行的虚拟页面代码,但不能更改页面内或相对于缓存行的对齐方式.因此,在一种情况下,我们看到的反汇编每次都会发生.
ASLR can change what virtual page code executes from (bit 12 and higher of the address), but not the alignment within a page or relative to a cache line. So what we see in disassembly in one case will happen every time.
这篇关于32字节对齐的例程不适合uops缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!