32 字节对齐例程不适合 uops 缓存 [英] 32-byte aligned routine does not fit the uops cache

查看:19
本文介绍了32 字节对齐例程不适合 uops 缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

KbL i7-8550U

我正在研究 uops-cache 的行为,但遇到了一个误解.

如英特尔优化手册 2.5.2.2(我的)中所述:

<块引用>

解码的 ICache 包含 32 个集合.每组包含八种方式.每条路最多可容纳六个微操作.

-

<块引用>

所有微操作都代表静态的指令在代码中是连续的,并且它们的 EIP 在同一个对齐中32 字节区域.

-

<块引用>

最多三种方式可以专用于同一个 32 字节对齐的块,允许每个 32 字节区域总共缓存 18 个微操作原始 IA 程序.

-

<块引用>

无条件分支是方式中的最后一个微操作.

<小时>

案例 1:

考虑以下例程:

uop.h

void prevent_uops_cache(size_t);

uop.S

对齐32hibit_uops_cache:mov edx, esimov edx, esimov edx, esimov edx, esimov edx, esimov edx, esijmp decrement_jmp_tgtdecrement_jmp_tgt:十二月ja prevent_uops_cache ;ja 是为了避免宏融合退

为了确保例程的代码实际上是 32 字节对齐的,这里是 asm

0x555555554820 mov edx,esi0x555555554822 <inhibit_uops_cache+2>mov edx,esi0x555555554824 <inhibit_uops_cache+4>mov edx,esi0x555555554826 <inhibit_uops_cache+6>mov edx,esi0x555555554828 <inhibit_uops_cache+8>mov edx,esi0x55555555482a <inhibit_uops_cache+10>mov edx,esi0x55555555482c <inhibit_uops_cache+12>jmp 0x55555555482e <decrement_jmp_tgt>0x55555555482e <decrement_jmp_tgt>十二月0x555555554831 <decrement_jmp_tgt+3>ja 0x555555554820 <inhibit_uops_cache>0x555555554833 <decrement_jmp_tgt+5>退0x555555554834 <decrement_jmp_tgt+6>没有0x555555554835 <decrement_jmp_tgt+7>没有0x555555554836 <decrement_jmp_tgt+8>没有0x555555554837 <decrement_jmp_tgt+9>没有0x555555554838 <decrement_jmp_tgt+10>没有0x555555554839 <decrement_jmp_tgt+11>没有0x55555555483a <decrement_jmp_tgt+12>没有0x55555555483b <decrement_jmp_tgt+13>没有0x55555555483c 没有0x55555555483d 没有0x55555555483e <decrement_jmp_tgt+16>没有0x55555555483f <decrement_jmp_tgt+17>没有

运行方式

int main(void){抑制 uops_cache(4096 * 4096 * 128L);}

我得到了计数器

 './bin' 的性能计数器统计信息:6431201748 idq.dsb_cycles (56,91%)19175741518 idq.dsb_uops (57,13%)7866687 idq.mite_uops (57,36%)3954421 idq.ms_uops (57,46%)560459 dsb2mite_switches.penalty_cycles (57,28%)884486 frontend_retired.dsb_miss (57,05%)6782598787 个周期 (56,82%)1,749000366 秒时间过去1,748985000 秒用户0,000000000 秒系统

这正是我所期望的.

绝大多数 uops 来自 uops 缓存.uops 数字也完全符合我的期望

mov edx, esi - 1 uop;jmp imm - 1 uop;靠近十二月 rdi - 1 uop;ja - 1 uop;靠近

4096 * 4096 * 128 * 9 = 19327352832 大约等于计数器 19326755442 + 3836395 + 1642975

<小时>

案例 2:

考虑inhibit_uops_cache的实现,它与注释掉的一条指令不同:

对齐32hibit_uops_cache:mov edx, esimov edx, esimov edx, esimov edx, esimov edx, esi;mov edx, esijmp decrement_jmp_tgtdecrement_jmp_tgt:十二月ja prevent_uops_cache ;ja 是为了避免宏融合退

疾病:

0x555555554820 mov edx,esi0x555555554822 <inhibit_uops_cache+2>mov edx,esi0x555555554824 <inhibit_uops_cache+4>mov edx,esi0x555555554826 <inhibit_uops_cache+6>mov edx,esi0x555555554828 <inhibit_uops_cache+8>mov edx,esi0x55555555482a <inhibit_uops_cache+10>jmp 0x55555555482c 0x55555555482c 十二月0x55555555482f <decrement_jmp_tgt+3>ja 0x555555554820 <inhibit_uops_cache>0x555555554831 <decrement_jmp_tgt+5>退0x555555554832 <decrement_jmp_tgt+6>没有0x555555554833 <decrement_jmp_tgt+7>没有0x555555554834 <decrement_jmp_tgt+8>没有0x555555554835 <decrement_jmp_tgt+9>没有0x555555554836 <decrement_jmp_tgt+10>没有0x555555554837 <decrement_jmp_tgt+11>没有0x555555554838 <decrement_jmp_tgt+12>没有0x555555554839 <decrement_jmp_tgt+13>没有0x55555555483a <decrement_jmp_tgt+14>没有0x55555555483b <decrement_jmp_tgt+15>没有0x55555555483c 没有0x55555555483d 没有0x55555555483e <decrement_jmp_tgt+18>没有0x55555555483f <decrement_jmp_tgt+19>没有

运行方式

int main(void){抑制 uops_cache(4096 * 4096 * 128L);}

我得到了计数器

 './bin' 的性能计数器统计信息:2464970970 idq.dsb_cycles (56,93%)6197024207 idq.dsb_uops (57,01%)10845763859 idq.mite_uops (57,19%)3022089 idq.ms_uops (57,38%)321614 dsb2mite_switches.penalty_cycles (57,35%)1733465236 frontend_retired.dsb_miss (57,16%)8405643642 次循环 (56,97%)2,117538141 秒时间过去2,117511000 秒用户0,000000000 秒系统

计数器完全出乎意料.

我希望所有的 uops 都像以前一样来自 dsb,因为例程符合 uops 缓存的要求.

相比之下,几乎 70% 的 uops 来自 Legacy Decode Pipeline.

问题:案例 2 有什么问题?要了解正在发生的事情,需要查看哪些计数器?

<小时>

UPD: 按照@PeterCordes 的想法,我检查了无条件分支目标 decrement_jmp_tgt 的 32 字节对齐.结果如下:

案例 3:

将 onconditional jump 目标对齐到 32 字节,如下所示

对齐32hibit_uops_cache:mov edx, esimov edx, esimov edx, esimov edx, esimov edx, esi;mov edx, esijmp decrement_jmp_tgt对齐 32 ;align 16 不会改变任何东西decrement_jmp_tgt:十二月ja prevent_uops_cache退

疾病:

0x555555554820 mov edx,esi0x555555554822 <inhibit_uops_cache+2>mov edx,esi0x555555554824 <inhibit_uops_cache+4>mov edx,esi0x555555554826 <inhibit_uops_cache+6>mov edx,esi0x555555554828 <inhibit_uops_cache+8>mov edx,esi0x55555555482a <inhibit_uops_cache+10>jmp 0x555555554840 <decrement_jmp_tgt>#nops 满足对齐0x555555554840 <decrement_jmp_tgt>十二月0x555555554843 <decrement_jmp_tgt+3>ja 0x555555554820 <inhibit_uops_cache>0x555555554845 <decrement_jmp_tgt+5>退

并运行为

int main(void){抑制 uops_cache(4096 * 4096 * 128L);}

我得到以下计数器

 './bin' 的性能计数器统计信息:4296298295 idq.dsb_cycles (57,19%)17145751147 idq.dsb_uops (57,32%)45834799 idq.mite_uops (57,32%)1896769 idq.ms_uops (57,32%)136865 dsb2mite_switches.penalty_cycles (57,04%)161314 frontend_retired.dsb_miss (56,90%)4319137397 个周期 (56,91%)1,096792233 秒时间过去1,096759000 秒用户0,000000000 秒系统

结果在意料之中.超过 99% 的 uops 来自 dsb.

平均 dsb uops 交付率 = 17145751147/4296298295 = 3.99

接近峰值带宽.

解决方案

这不是 OP 问题的答案,而是需要注意的问题

请参阅


你代码中 ja 的最后一个字节是 ...30,所以它是罪魁祸首.

如果这是一个 32 字节的边界,而不仅仅是 16,那么我们就会遇到问题:

0x55555555482a jmp#很好0x55555555482c 十二月0x55555555482f <decrement_jmp_tgt+3>ja # 跨越 16B 边界(不是 32)0x555555554831 <decrement_jmp_tgt+5>ret # 很好

这部分没有完全更新,还在讨论跨越32B边界

JA 本身跨越了一个边界.

插入一个 NOP after dec rdi 应该可以工作,将 2 字节 ja 完全放在边界之后,并带有一个新的 32 字节块.无论如何,dec/ja 的宏融合是不可能的,因为 JA 读取 CF(和 ZF)但 DEC 不写入 CF.

使用 sub rdi, 1 移动 JA 不起作用;它会进行宏融合,与该指令对应的 x86 代码的组合 6 字节仍将跨越边界.

您可以在 jmp 之前使用单字节 nops 而不是 mov 来更早地移动所有内容,如果在块的最后一个字节之前将所有内容全部放入.


ASLR 可以更改虚拟页面代码的执行位置(地址的第 12 位和更高位),但不能更改页面内的对齐方式或相对于缓存线的对齐方式.所以我们在一种情况下在反汇编中看到的情况每次都会发生.

KbL i7-8550U

I'm researching the behavior of uops-cache and came across a misunderstanding regarding it.

As specified in the Intel Optimization Manual 2.5.2.2 (emp. mine):

The Decoded ICache consists of 32 sets. Each set contains eight Ways. Each Way can hold up to six micro-ops.

-

All micro-ops in a Way represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region.

-

Up to three Ways may be dedicated to the same 32-byte aligned chunk, allowing a total of 18 micro-ops to be cached per 32-byte region of the original IA program.

-

A non-conditional branch is the last micro-op in a Way.


CASE 1:

Consider the following routine:

uop.h

void inhibit_uops_cache(size_t);

uop.S

align 32
inhibit_uops_cache:
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    jmp decrement_jmp_tgt
decrement_jmp_tgt:
    dec rdi
    ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
    ret

To make sure that the code of the routine is actually 32-bytes aligned here is the asm

0x555555554820 <inhibit_uops_cache>     mov    edx,esi
0x555555554822 <inhibit_uops_cache+2>   mov    edx,esi
0x555555554824 <inhibit_uops_cache+4>   mov    edx,esi
0x555555554826 <inhibit_uops_cache+6>   mov    edx,esi
0x555555554828 <inhibit_uops_cache+8>   mov    edx,esi
0x55555555482a <inhibit_uops_cache+10>  mov    edx,esi
0x55555555482c <inhibit_uops_cache+12>  jmp    0x55555555482e <decrement_jmp_tgt>
0x55555555482e <decrement_jmp_tgt>      dec    rdi
0x555555554831 <decrement_jmp_tgt+3>    ja     0x555555554820 <inhibit_uops_cache>
0x555555554833 <decrement_jmp_tgt+5>    ret
0x555555554834 <decrement_jmp_tgt+6>    nop
0x555555554835 <decrement_jmp_tgt+7>    nop
0x555555554836 <decrement_jmp_tgt+8>    nop
0x555555554837 <decrement_jmp_tgt+9>    nop
0x555555554838 <decrement_jmp_tgt+10>   nop
0x555555554839 <decrement_jmp_tgt+11>   nop
0x55555555483a <decrement_jmp_tgt+12>   nop
0x55555555483b <decrement_jmp_tgt+13>   nop
0x55555555483c <decrement_jmp_tgt+14>   nop
0x55555555483d <decrement_jmp_tgt+15>   nop
0x55555555483e <decrement_jmp_tgt+16>   nop
0x55555555483f <decrement_jmp_tgt+17>   nop             

running as

int main(void){
    inhibit_uops_cache(4096 * 4096 * 128L);
}

I got the counters

 Performance counter stats for './bin':

     6 431 201 748      idq.dsb_cycles                                                (56,91%)
    19 175 741 518      idq.dsb_uops                                                  (57,13%)
         7 866 687      idq.mite_uops                                                 (57,36%)
         3 954 421      idq.ms_uops                                                   (57,46%)
           560 459      dsb2mite_switches.penalty_cycles                                     (57,28%)
           884 486      frontend_retired.dsb_miss                                     (57,05%)
     6 782 598 787      cycles                                                        (56,82%)

       1,749000366 seconds time elapsed

       1,748985000 seconds user
       0,000000000 seconds sys

This is exactly what I expected to get.

The vast majority of uops came from uops cache. Also uops number perfectly matches with my expectation

mov edx, esi - 1 uop;
jmp imm      - 1 uop; near 
dec rdi      - 1 uop;
ja           - 1 uop; near

4096 * 4096 * 128 * 9 = 19 327 352 832 approximately equal to the counters 19 326 755 442 + 3 836 395 + 1 642 975


CASE 2:

Consider the implementation of inhibit_uops_cache which is different by one instruction commented out:

align 32
inhibit_uops_cache:
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    ; mov edx, esi
    jmp decrement_jmp_tgt
decrement_jmp_tgt:
    dec rdi
    ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
    ret

disas:

0x555555554820 <inhibit_uops_cache>     mov    edx,esi
0x555555554822 <inhibit_uops_cache+2>   mov    edx,esi
0x555555554824 <inhibit_uops_cache+4>   mov    edx,esi
0x555555554826 <inhibit_uops_cache+6>   mov    edx,esi
0x555555554828 <inhibit_uops_cache+8>   mov    edx,esi
0x55555555482a <inhibit_uops_cache+10>  jmp    0x55555555482c <decrement_jmp_tgt>
0x55555555482c <decrement_jmp_tgt>      dec    rdi
0x55555555482f <decrement_jmp_tgt+3>    ja     0x555555554820 <inhibit_uops_cache>
0x555555554831 <decrement_jmp_tgt+5>    ret
0x555555554832 <decrement_jmp_tgt+6>    nop
0x555555554833 <decrement_jmp_tgt+7>    nop
0x555555554834 <decrement_jmp_tgt+8>    nop
0x555555554835 <decrement_jmp_tgt+9>    nop
0x555555554836 <decrement_jmp_tgt+10>   nop
0x555555554837 <decrement_jmp_tgt+11>   nop
0x555555554838 <decrement_jmp_tgt+12>   nop
0x555555554839 <decrement_jmp_tgt+13>   nop
0x55555555483a <decrement_jmp_tgt+14>   nop
0x55555555483b <decrement_jmp_tgt+15>   nop
0x55555555483c <decrement_jmp_tgt+16>   nop
0x55555555483d <decrement_jmp_tgt+17>   nop
0x55555555483e <decrement_jmp_tgt+18>   nop
0x55555555483f <decrement_jmp_tgt+19>   nop                      

running as

int main(void){
    inhibit_uops_cache(4096 * 4096 * 128L);
}

I got the counters

 Performance counter stats for './bin':

     2 464 970 970      idq.dsb_cycles                                                (56,93%)
     6 197 024 207      idq.dsb_uops                                                  (57,01%)
    10 845 763 859      idq.mite_uops                                                 (57,19%)
         3 022 089      idq.ms_uops                                                   (57,38%)
           321 614      dsb2mite_switches.penalty_cycles                                     (57,35%)
     1 733 465 236      frontend_retired.dsb_miss                                     (57,16%)
     8 405 643 642      cycles                                                        (56,97%)

       2,117538141 seconds time elapsed

       2,117511000 seconds user
       0,000000000 seconds sys

The counters are completely unexpected.

I expected all the uops come from dsb as before since the routine matches the requirements of uops cache.

By contrast, almost 70% of uops came from Legacy Decode Pipeline.

QUESTION: What's wrong with the CASE 2? What counters to look at to understand what's going on?


UPD: Following @PeterCordes idea I checked the 32-byte alignment of the unconditional branch target decrement_jmp_tgt. Here is the result:

CASE 3:

Aligning onconditional jump target to 32 byte as follows

align 32
inhibit_uops_cache:
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    mov edx, esi
    ; mov edx, esi
    jmp decrement_jmp_tgt
align 32 ; align 16 does not change anything
decrement_jmp_tgt:
    dec rdi
    ja inhibit_uops_cache
    ret

disas:

0x555555554820 <inhibit_uops_cache>     mov    edx,esi
0x555555554822 <inhibit_uops_cache+2>   mov    edx,esi
0x555555554824 <inhibit_uops_cache+4>   mov    edx,esi
0x555555554826 <inhibit_uops_cache+6>   mov    edx,esi
0x555555554828 <inhibit_uops_cache+8>   mov    edx,esi
0x55555555482a <inhibit_uops_cache+10>  jmp    0x555555554840 <decrement_jmp_tgt>
#nops to meet the alignment
0x555555554840 <decrement_jmp_tgt>      dec    rdi
0x555555554843 <decrement_jmp_tgt+3>    ja     0x555555554820 <inhibit_uops_cache>
0x555555554845 <decrement_jmp_tgt+5>    ret                                              

and running as

int main(void){
    inhibit_uops_cache(4096 * 4096 * 128L);
}

I got the following counters

 Performance counter stats for './bin':

     4 296 298 295      idq.dsb_cycles                                                (57,19%)
    17 145 751 147      idq.dsb_uops                                                  (57,32%)
        45 834 799      idq.mite_uops                                                 (57,32%)
         1 896 769      idq.ms_uops                                                   (57,32%)
           136 865      dsb2mite_switches.penalty_cycles                                     (57,04%)
           161 314      frontend_retired.dsb_miss                                     (56,90%)
     4 319 137 397      cycles                                                        (56,91%)

       1,096792233 seconds time elapsed

       1,096759000 seconds user
       0,000000000 seconds sys

The result is perfectly expected. More then 99% of the uops came from dsb.

Avg dsb uops delivery rate = 17 145 751 147 / 4 296 298 295 = 3.99

Which is close to the peak bandwith.

解决方案

This is not the answer to the OP's problem, but is one to watch out for

See Code alignment dramatically affects performance for compiler options to work around this performance pothole Intel introduced into Skylake-derived CPUs, as part of this workaround.


Other observations: the block of 6 mov instructions should fill a uop cache line, with jmp in a line by itself. In case 2, the 5 mov + jmp should fit in one cache line (or more properly "way").

(Posting this for the benefit of future readers who might have the same symptoms but a different cause. I realized right as I finished writing it that 0x...30 is not a 32-byte boundary, only 0x...20 and 40, so this erratum shouldn't be the problem for the code in the question.)


A recent (late 2019) microcode update introduced a new performance pothole. It works around Intel's JCC erratum on Skylake-derived microarchitectures. (KBL142 on your Kaby-Lake specifically).

Microcode Update (MCU) to Mitigate JCC Erratum

This erratum can be prevented by a microcode update (MCU). The MCU prevents jump instructions from being cached in the Decoded ICache when the jump instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In this context, Jump Instructions include all jump types: conditional jump (Jcc), macrofused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct unconditional jump, indirect jump, direct/indirect call, and return.

Intel's whitepaper also includes a diagram of cases that trigger this non-uop-cacheable effect. (PDF screenshot borrowed from a Phoronix article with benchmarks before/after, and after with rebuilding with some workarounds in GCC/GAS that try to avoid this new performance pitfall).


The last byte of the ja in your code is ...30, so it's the culprit.

If this was a 32-byte boundary, not just 16, then we'd have the problem here:

0x55555555482a <inhibit_uops_cache+10>  jmp         # fine
0x55555555482c <decrement_jmp_tgt>      dec    rdi
0x55555555482f <decrement_jmp_tgt+3>    ja          # spans 16B boundary (not 32)
0x555555554831 <decrement_jmp_tgt+5>    ret         # fine

This section not fully updated, still talking about spanning a 32B boundary

JA itself spans a boundary.

Inserting a NOP after dec rdi should work, putting the 2-byte ja fully after the boundary with a new 32-byte chunk. Macro-fusion of dec/ja wasn't possible anyway because JA reads CF (and ZF) but DEC doesn't write CF.

Using sub rdi, 1 to move the JA would not work; it would macro-fuse, and the combined 6 bytes of x86 code corresponding to that instruction would still span the boundary.

You could use single-byte nops instead of mov before the jmp to move everything earlier, if that gets it all in before the last byte of a block.


ASLR can change what virtual page code executes from (bit 12 and higher of the address), but not the alignment within a page or relative to a cache line. So what we see in disassembly in one case will happen every time.

这篇关于32 字节对齐例程不适合 uops 缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆