32字节对齐的例程不适合uops缓存 [英] 32-byte aligned routine does not fit the uops cache

查看:88
本文介绍了32字节对齐的例程不适合uops缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

KbL i7-8550U

我正在研究uops-cache的行为,并且遇到了关于它的误解.

I'm researching the behavior of uops-cache and came across a misunderstanding regarding it.

如《英特尔优化手册》 2.5.2.2(我的矿山)中所指定:

As specified in the Intel Optimization Manual 2.5.2.2 (emp. mine):

解码的ICache由32组组成.每套包含八种方式. 每种方式最多可容纳六个微操作.

The Decoded ICache consists of 32 sets. Each set contains eight Ways. Each Way can hold up to six micro-ops.

-

所有微操作都以某种方式表示静态的指令 在代码中是连续的,并且它们的EIP在同一行内 32个字节的区域.

All micro-ops in a Way represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region.

-

最多可以将三种方式专用于相同的32字节对齐块, 允许在每个32字节的区域中总共缓存18个微操作 原始的IA程序.

Up to three Ways may be dedicated to the same 32-byte aligned chunk, allowing a total of 18 micro-ops to be cached per 32-byte region of the original IA program.

-

无条件分支是某种方式中的最后一个微操作.

A non-conditional branch is the last micro-op in a Way.


案例1:

请考虑以下例程:

uop.h

void inhibit_uops_cache(size_t);

uop.S

align 32
inhibit_uops_cache:
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    jmp decrement_jmp_tgt
decrement_jmp_tgt:
    dec rdi
    ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
    ret

为确保例程的代码实际上是32字节对齐,这里是asm

To make sure that the code of the routine is actually 32-bytes aligned here is the asm

0x555555554820 <inhibit_uops_cache>     mov    edx,esi
0x555555554822 <inhibit_uops_cache+2>   mov    edx,esi
0x555555554824 <inhibit_uops_cache+4>   mov    edx,esi
0x555555554826 <inhibit_uops_cache+6>   mov    edx,esi
0x555555554828 <inhibit_uops_cache+8>   mov    edx,esi
0x55555555482a <inhibit_uops_cache+10>  mov    edx,esi
0x55555555482c <inhibit_uops_cache+12>  jmp    0x55555555482e <decrement_jmp_tgt>
0x55555555482e <decrement_jmp_tgt>      dec    rdi
0x555555554831 <decrement_jmp_tgt+3>    ja     0x555555554820 <inhibit_uops_cache>
0x555555554833 <decrement_jmp_tgt+5>    ret
0x555555554834 <decrement_jmp_tgt+6>    nop
0x555555554835 <decrement_jmp_tgt+7>    nop
0x555555554836 <decrement_jmp_tgt+8>    nop
0x555555554837 <decrement_jmp_tgt+9>    nop
0x555555554838 <decrement_jmp_tgt+10>   nop
0x555555554839 <decrement_jmp_tgt+11>   nop
0x55555555483a <decrement_jmp_tgt+12>   nop
0x55555555483b <decrement_jmp_tgt+13>   nop
0x55555555483c <decrement_jmp_tgt+14>   nop
0x55555555483d <decrement_jmp_tgt+15>   nop
0x55555555483e <decrement_jmp_tgt+16>   nop
0x55555555483f <decrement_jmp_tgt+17>   nop             

运行为

int main(void){
    inhibit_uops_cache(4096 * 4096 * 128L);
}

我有柜台

 Performance counter stats for './bin':

     6 431 201 748      idq.dsb_cycles                                                (56,91%)
    19 175 741 518      idq.dsb_uops                                                  (57,13%)
         7 866 687      idq.mite_uops                                                 (57,36%)
         3 954 421      idq.ms_uops                                                   (57,46%)
           560 459      dsb2mite_switches.penalty_cycles                                     (57,28%)
           884 486      frontend_retired.dsb_miss                                     (57,05%)
     6 782 598 787      cycles                                                        (56,82%)

       1,749000366 seconds time elapsed

       1,748985000 seconds user
       0,000000000 seconds sys

这正是我期望得到的.

绝大多数uops来自uops缓存. uops数字也完全符合我的期望

The vast majority of uops came from uops cache. Also uops number perfectly matches with my expectation

mov edx, esi - 1 uop;
jmp imm      - 1 uop; near 
dec rdi      - 1 uop;
ja           - 1 uop; near

4096 * 4096 * 128 * 9 = 19 327 352 832大约等于计数器19 326 755 442 + 3 836 395 + 1 642 975

案例2:

考虑inhibit_uops_cache的实现,这与注释掉的一条指令不同:

Consider the implementation of inhibit_uops_cache which is different by one instruction commented out:

align 32
inhibit_uops_cache:
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    ; mov edx, esi
    jmp decrement_jmp_tgt
decrement_jmp_tgt:
    dec rdi
    ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
    ret

问题:

0x555555554820 <inhibit_uops_cache>     mov    edx,esi
0x555555554822 <inhibit_uops_cache+2>   mov    edx,esi
0x555555554824 <inhibit_uops_cache+4>   mov    edx,esi
0x555555554826 <inhibit_uops_cache+6>   mov    edx,esi
0x555555554828 <inhibit_uops_cache+8>   mov    edx,esi
0x55555555482a <inhibit_uops_cache+10>  jmp    0x55555555482c <decrement_jmp_tgt>
0x55555555482c <decrement_jmp_tgt>      dec    rdi
0x55555555482f <decrement_jmp_tgt+3>    ja     0x555555554820 <inhibit_uops_cache>
0x555555554831 <decrement_jmp_tgt+5>    ret
0x555555554832 <decrement_jmp_tgt+6>    nop
0x555555554833 <decrement_jmp_tgt+7>    nop
0x555555554834 <decrement_jmp_tgt+8>    nop
0x555555554835 <decrement_jmp_tgt+9>    nop
0x555555554836 <decrement_jmp_tgt+10>   nop
0x555555554837 <decrement_jmp_tgt+11>   nop
0x555555554838 <decrement_jmp_tgt+12>   nop
0x555555554839 <decrement_jmp_tgt+13>   nop
0x55555555483a <decrement_jmp_tgt+14>   nop
0x55555555483b <decrement_jmp_tgt+15>   nop
0x55555555483c <decrement_jmp_tgt+16>   nop
0x55555555483d <decrement_jmp_tgt+17>   nop
0x55555555483e <decrement_jmp_tgt+18>   nop
0x55555555483f <decrement_jmp_tgt+19>   nop                      

运行为

int main(void){
    inhibit_uops_cache(4096 * 4096 * 128L);
}

我有柜台

 Performance counter stats for './bin':

     2 464 970 970      idq.dsb_cycles                                                (56,93%)
     6 197 024 207      idq.dsb_uops                                                  (57,01%)
    10 845 763 859      idq.mite_uops                                                 (57,19%)
         3 022 089      idq.ms_uops                                                   (57,38%)
           321 614      dsb2mite_switches.penalty_cycles                                     (57,35%)
     1 733 465 236      frontend_retired.dsb_miss                                     (57,16%)
     8 405 643 642      cycles                                                        (56,97%)

       2,117538141 seconds time elapsed

       2,117511000 seconds user
       0,000000000 seconds sys

计数器完全出乎意料.

我希望所有uops都像以前一样来自dsb,因为该例程符合uops缓存的要求.

相比之下,几乎70%的广告来自旧版解码管道".

问题: CASE 2有什么问题?要了解发生了什么情况需要查看哪些计数器?

QUESTION: What's wrong with the CASE 2? What counters to look at to understand what's going on?

UPD:按照@PeterCordes的想法,我检查了无条件分支目标decrement_jmp_tgt的32字节对齐方式.结果如下:

UPD: Following @PeterCordes idea I checked the 32-byte alignment of the unconditional branch target decrement_jmp_tgt. Here is the result:

案例3:

按条件将有条件的jump目标对齐为32个字节

Aligning onconditional jump target to 32 byte as follows

align 32
inhibit_uops_cache:
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    ; mov edx, esi
    jmp decrement_jmp_tgt
align 32 ; align 16 does not change anything
decrement_jmp_tgt:
    dec rdi
    ja inhibit_uops_cache
    ret

问题:

0x555555554820 <inhibit_uops_cache>     mov    edx,esi
0x555555554822 <inhibit_uops_cache+2>   mov    edx,esi
0x555555554824 <inhibit_uops_cache+4>   mov    edx,esi
0x555555554826 <inhibit_uops_cache+6>   mov    edx,esi
0x555555554828 <inhibit_uops_cache+8>   mov    edx,esi
0x55555555482a <inhibit_uops_cache+10>  jmp    0x555555554840 <decrement_jmp_tgt>
#nops to meet the alignment
0x555555554840 <decrement_jmp_tgt>      dec    rdi
0x555555554843 <decrement_jmp_tgt+3>    ja     0x555555554820 <inhibit_uops_cache>
0x555555554845 <decrement_jmp_tgt+5>    ret                                              

并以

int main(void){
    inhibit_uops_cache(4096 * 4096 * 128L);
}

我得到了以下计数器

 Performance counter stats for './bin':

     4 296 298 295      idq.dsb_cycles                                                (57,19%)
    17 145 751 147      idq.dsb_uops                                                  (57,32%)
        45 834 799      idq.mite_uops                                                 (57,32%)
         1 896 769      idq.ms_uops                                                   (57,32%)
           136 865      dsb2mite_switches.penalty_cycles                                     (57,04%)
           161 314      frontend_retired.dsb_miss                                     (56,90%)
     4 319 137 397      cycles                                                        (56,91%)

       1,096792233 seconds time elapsed

       1,096759000 seconds user
       0,000000000 seconds sys

完全可以预期结果.超过99%的微词来自dsb.

dsb的平均投放速度= 17 145 751 147 / 4 296 298 295 = 3.99

Avg dsb uops delivery rate = 17 145 751 147 / 4 296 298 295 = 3.99

哪个接近峰值带宽.

推荐答案

这不是OP的问题的答案,而是要提防的

其他观察结果:6条mov指令的块应填充uop缓存行,而jmp本身应位于一行中.在第2种情况下,5个mov + jmp应该适合一个高速缓存行(或更恰当地说是"way").

This is not the answer to the OP's problem, but is one to watch out for

Other observations: the block of 6 mov instructions should fill a uop cache line, with jmp in a line by itself. In case 2, the 5 mov + jmp should fit in one cache line (or more properly "way").

(发布此文章是为了让将来可能有相同症状但原因不同的读者受益.我在写完《 0x...30》不是 32个字节的边界,只有0x...2040,所以这个错误应该不是问题代码的问题.)

(Posting this for the benefit of future readers who might have the same symptoms but a different cause. I realized right as I finished writing it that 0x...30 is not a 32-byte boundary, only 0x...20 and 40, so this erratum shouldn't be the problem for the code in the question.)

最近(2019年末)的微代码更新引入了一个新的性能隐患.该问题适用于基于Skylake派生的微体系结构的英特尔JCC勘误表. (特别是您的Kaby-Lake上的KBL142).

A recent (late 2019) microcode update introduced a new performance pothole. It works around Intel's JCC erratum on Skylake-derived microarchitectures. (KBL142 on your Kaby-Lake specifically).

微码更新(MCU)以减轻JCC勘误

可以通过微码更新(MCU)防止这种错误. MCU 防止 跳转时跳转指令从缓存在解码的ICache中 指令越过32字节边界或结束于32字节边界.在 在这种情况下,跳转指令包括所有跳转类型:条件跳转(Jcc),宏融合的op-Jcc(其中op是cmp,test,add,sub和in,inc或dec之一),直接 无条件跳转,间接跳转,直接/间接调用和返回.

This erratum can be prevented by a microcode update (MCU). The MCU prevents jump instructions from being cached in the Decoded ICache when the jump instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In this context, Jump Instructions include all jump types: conditional jump (Jcc), macrofused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct unconditional jump, indirect jump, direct/indirect call, and return.

英特尔白皮书还包括了触发这种非uop可缓存效果的情况的图表. (从 Phoonix文章中借用的PDF屏幕截图之前/之后和之后的基准测试,以及在GCC/GAS中尝试避免这种新性能陷阱的一些变通方法.

Intel's whitepaper also includes a diagram of cases that trigger this non-uop-cacheable effect. (PDF screenshot borrowed from a Phoronix article with benchmarks before/after, and after with rebuilding with some workarounds in GCC/GAS that try to avoid this new performance pitfall).

代码中ja的最后一个字节是...30,因此是罪魁祸首.

The last byte of the ja in your code is ...30, so it's the culprit.

如果这是一个32字节的边界,而不仅仅是16个字节,那么我们这里就会遇到问题:

If this was a 32-byte boundary, not just 16, then we'd have the problem here:

0x55555555482a <inhibit_uops_cache+10>  jmp         # fine
0x55555555482c <decrement_jmp_tgt>      dec    rdi
0x55555555482f <decrement_jmp_tgt+3>    ja          # spans 16B boundary (not 32)
0x555555554831 <decrement_jmp_tgt+5>    ret         # fine

本节尚未完全更新,仍在讨论跨越32B边界

JA本身跨越了一个边界.

JA itself spans a boundary.

dec rdi之后插入NOP 应该可以,将2字节的ja完全放在带有新的32字节块的边界之后.无论如何,不​​可能实现dec/ja的宏融合,因为JA读取CF(和ZF),但是DEC不写入CF.

Inserting a NOP after dec rdi should work, putting the 2-byte ja fully after the boundary with a new 32-byte chunk. Macro-fusion of dec/ja wasn't possible anyway because JA reads CF (and ZF) but DEC doesn't write CF.

使用sub rdi, 1移动JA不会 ;它会进行宏融合,并且与该指令相对应的x86代码的组合6字节仍将跨越边界.

Using sub rdi, 1 to move the JA would not work; it would macro-fuse, and the combined 6 bytes of x86 code corresponding to that instruction would still span the boundary.

您可以在jmp之前使用单字节nops代替mov,以便将所有内容移到更早的位置,如果可以将其全部移入块的最后一个字节之前.

You could use single-byte nops instead of mov before the jmp to move everything earlier, if that gets it all in before the last byte of a block.

ASLR可以更改从(地址的第12位及更高位)执行的虚拟页面代码,但不能更改页面内或相对于缓存行的对齐方式.因此,在一种情况下,我们看到的反汇编每次都会发生.

ASLR can change what virtual page code executes from (bit 12 and higher of the address), but not the alignment within a page or relative to a cache line. So what we see in disassembly in one case will happen every time.

这篇关于32字节对齐的例程不适合uops缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆