一系列x86调用/退出指令是否形成从属链? [英] Does a series of x86 call/ret instructions form a dependent chain?

查看:42
本文介绍了一系列x86调用/退出指令是否形成从属链?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下x86-64程序集:

inner:
   ...
   ret

outer:
.top:
   call inner
   dec  rdi
   jnz  .top
   ret

函数outer只是重复地对函数inner进行call(其主体未显示-可能为空).

outer中的call指令系列和inner内部的相应ret指令是否在实践中形成了从属链(出于评估性能的目的)?

可以通过多种方式形成此链.例如,ret是否依赖于前面的call指令的等待时间,然后随后的call指令是否依赖于ret并形成call -> ret -> call链?还是ret是独立的,但call不是独立的,形成了call -> call链?如果有链,是通过内存,寄存器,堆栈引擎,返回地址预测变量 1 还是什么?

动机:该问题源自对另一个问题的一系列评论,主要是

否,分支预测+投机执行打破了存储/重载依赖性.

从返回地址预测器中,(推测地)RIP由前端(推测地)知道.因此,下一条call指令可以推送一个返回地址,而无需等待ret执行(并且实际上根据堆栈中的数据加载并验证了预测的返回地址的正确性).

推测性商店可以进入商店缓冲区并进行商店转发.

当然有一个依赖关系链,它不是循环携带的.乱序执行会使许多迭代一直在进行中,从而将其隐藏起来.

证明: call的存储区损坏否则将是循环承载的内存依赖关系链.

align 64
global _start
_start:
    mov     ebp, 250000000   ; I had been unrolling by 4, should have changed this to 5000... before measuring, but forgot.

align 32
.mainloop:
    call  delay_retaddr
    call  delay_retaddr

    dec ebp
    jg .mainloop

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

    ;; Placing this function *before* _start, or increasing the alignment,
    ;; makes it somewhat slower!
align 32
delay_retaddr:
    add qword [rsp], 0
    add qword [rsp], 0    ; create latency for the ret addr
    ret

组装并链接到yasm -felf64 -Worphan-labels -gdwarf2 foo.asm && ld -o foo foo.o,生成静态ELF二进制文件.

使用 ocperf.py (在i7-6700k上)进行配置,我得到每个核心时钟周期0.99条指令:

$ taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dsb2mite_switches.penalty_cycles -r2 ./foo

 Performance counter stats for './foo' (2 runs):

        645.770390      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.05% )
                 1      context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0      cpu-migrations            #    0.000 K/sec
                 2      page-faults               #    0.004 K/sec                    ( +- 20.00% )
     2,517,412,984      cycles                    #    3.898 GHz                      ( +-  0.09% )
     1,250,159,413      branches                  # 1935.919 M/sec                    ( +-  0.00% )
     2,500,838,090      instructions              #    0.99  insn per cycle           ( +-  0.00% )
     4,010,093,750      uops_issued_any           # 6209.783 M/sec                    ( +-  0.03% )
     7,010,150,784      uops_executed_thread      # 10855.485 M/sec                   ( +-  0.02% )
            62,838      dsb2mite_switches_penalty_cycles #    0.097 M/sec                    ( +- 30.92% )

       0.645899414 seconds time elapsed                                          ( +-  0.05% )

使用_start之前的调用函数和128的对齐值,IPC可以从0.99降至0.84,这是非常奇怪的. dsb2mite交换机的计数仍然偏低,因此它仍主要通过uop缓存运行,而不是传统的解码器. (此Skylake CPU的微代码更新会禁用循环缓冲区,以防与所有此类跳转有关.)

为了保持良好的吞吐量,CPU必须保持内部循环的多次迭代,因为我们已经大大延长了需要重叠的独立dep链.


add [rsp], 0指令更改为[rsp+16]会在其他位置创建循环承载的依赖链call不会将其存储到该链中.因此,循环在存储转发延迟上遇到瓶颈,并以大约一半的速度运行.

# With  add qword [rsp+16], 0

 Performance counter stats for './foo' (2 runs):

   1212.339007      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.04% )
             2      context-switches          #    0.002 K/sec                    ( +- 60.00% )
             0      cpu-migrations            #    0.000 K/sec                  
             2      page-faults               #    0.002 K/sec                  
 4,727,361,809      cycles                    #    3.899 GHz                      ( +-  0.02% )
 1,250,292,058      branches                  # 1031.306 M/sec                    ( +-  0.00% )
 2,501,537,152      instructions              #    0.53  insn per cycle           ( +-  0.00% )
 4,026,138,227      uops_issued_any           # 3320.967 M/sec                    ( +-  0.02% )
 7,026,457,222      uops_executed_thread      # 5795.786 M/sec                    ( +-  0.01% )
       230,287      dsb2mite_switches_penalty_cycles #    0.190 M/sec                    ( +- 68.23% )

   1.212612110 seconds time elapsed                                          ( +-  0.04% )

请注意,我仍然使用相对于RSP的地址,因此仍然存在堆栈同步uop.通过使用相对于不同寄存器的地址(例如rbp)来寻址call/ret存储/重载返回地址的位置,我本可以保持两种情况不变,并避免两种情况发生.

我认为存储转发的可变等待时间(在简单的立即背对背重新加载情况下更差)不足以解释这种差异. 添加冗余分配可在不使用编译器的情况下加速代码优化.这是从打破依赖关系到2加速的一个因素. ( 0.99 IPC与0.53 IPC,具有相同的指令,只是寻址方式不同.)

在寻址模式下,使用disp8时指令要长1个字节,并且在较快版本中存在前端对齐的怪异,但是在[rsp+16]版本中,移动内容似乎并没有改变


使用创建存储转发停顿的版本(add dword [rsp], 0)会使dep链太长,以至于OoO执行人员不容易隐藏.我并没有玩这么多.

Consider the following x86-64 assembly:

inner:
   ...
   ret

outer:
.top:
   call inner
   dec  rdi
   jnz  .top
   ret

The function outer simply repeatedly makes a call to the function inner (whose body isn't shown - it may be empty).

Does the series of call instructions in outer, and the corresponding ret instructions inside inner form a dependent chain in practice (for the purposes of estimating performance)?

There is more than one way this chain could be formed. For example, does the ret depend on the latency of the preceding call instruction and then does the subsequent call instruction depend on the ret, forming a call -> ret -> call chain? Or perhaps the ret is independent but the call is not, forming a call -> call chain? If there is a chain, is it through memory, a register, the stack engine, the return address predictor1, or what?

Motivation: This question originated from a series of comments on another question, mostly this comment and earlier ones.


1 The terminology might be somewhat unclear here: the stack engine is normally understood to handle transforming rsp-modifying instructions into a single access with an appropriate offset, so that push rax; push rbx might be transformed into something like mov [t0], rax; mov [t0 - 8], rbx where t0 is some temporary register that captured the value of rsp at some point. It also understood to handle a similar transformation for call and ret instructions, which both modify the stack in a way similar to push and pop as well as including a direct, indirect (respectively) jump. The CPU also includes a mechanism to predict that return indirect jump, which some lump under "stack engine" - but here I'm separating that out into "return address predictor".

解决方案

No, branch-prediction + speculative execution break the store/reload dependency.

RIP is (speculatively) known by the front-end, from the return-address predictor. The next call instruction can thus push a return address without waiting for the ret to execute (and actually load and verity the correctness of the predicted return address against the data from the stack).

Speculative stores can enter the store buffer and be store-forwarded.

There is of course a dependency chain, it's not loop-carried. Out-of-order execution hides it by keeping many iterations in flight.

Proof: call's store breaks what would otherwise be a loop-carried memory dependency chain.

align 64
global _start
_start:
    mov     ebp, 250000000   ; I had been unrolling by 4, should have changed this to 5000... before measuring, but forgot.

align 32
.mainloop:
    call  delay_retaddr
    call  delay_retaddr

    dec ebp
    jg .mainloop

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

    ;; Placing this function *before* _start, or increasing the alignment,
    ;; makes it somewhat slower!
align 32
delay_retaddr:
    add qword [rsp], 0
    add qword [rsp], 0    ; create latency for the ret addr
    ret

Assemble and link with yasm -felf64 -Worphan-labels -gdwarf2 foo.asm && ld -o foo foo.o, producing a static ELF binary.

Profiled (on an i7-6700k) with ocperf.py, I get 0.99 instructions per core clock cycle:

$ taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dsb2mite_switches.penalty_cycles -r2 ./foo

 Performance counter stats for './foo' (2 runs):

        645.770390      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.05% )
                 1      context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0      cpu-migrations            #    0.000 K/sec
                 2      page-faults               #    0.004 K/sec                    ( +- 20.00% )
     2,517,412,984      cycles                    #    3.898 GHz                      ( +-  0.09% )
     1,250,159,413      branches                  # 1935.919 M/sec                    ( +-  0.00% )
     2,500,838,090      instructions              #    0.99  insn per cycle           ( +-  0.00% )
     4,010,093,750      uops_issued_any           # 6209.783 M/sec                    ( +-  0.03% )
     7,010,150,784      uops_executed_thread      # 10855.485 M/sec                   ( +-  0.02% )
            62,838      dsb2mite_switches_penalty_cycles #    0.097 M/sec                    ( +- 30.92% )

       0.645899414 seconds time elapsed                                          ( +-  0.05% )

With the called function before _start, and alignment values of 128, IPC can go down from 0.99 to 0.84, which is super-weird. Counts for dsb2mite switches are still low-ish, so it's mostly still running from the uop cache, not the legacy decoders. (This Skylake CPU has the microcode update that disables the loop buffer, in case that would be relevant with all this jumping.)

To sustain good throughput, the CPU has to keep many iterations of the inner loop in flight because we've significantly lengthened the independent dep chains that need to overlap.


Changing the add [rsp], 0 instructions to [rsp+16] creates a loop-carried dependency chain on a different location, which isn't being stored to by call. So the loop bottlenecks on that store-forwarding latency and runs at ~half speed.

# With  add qword [rsp+16], 0

 Performance counter stats for './foo' (2 runs):

   1212.339007      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.04% )
             2      context-switches          #    0.002 K/sec                    ( +- 60.00% )
             0      cpu-migrations            #    0.000 K/sec                  
             2      page-faults               #    0.002 K/sec                  
 4,727,361,809      cycles                    #    3.899 GHz                      ( +-  0.02% )
 1,250,292,058      branches                  # 1031.306 M/sec                    ( +-  0.00% )
 2,501,537,152      instructions              #    0.53  insn per cycle           ( +-  0.00% )
 4,026,138,227      uops_issued_any           # 3320.967 M/sec                    ( +-  0.02% )
 7,026,457,222      uops_executed_thread      # 5795.786 M/sec                    ( +-  0.01% )
       230,287      dsb2mite_switches_penalty_cycles #    0.190 M/sec                    ( +- 68.23% )

   1.212612110 seconds time elapsed                                          ( +-  0.04% )

Note that I'm still using an RSP-relative address so there's still a stack-sync uop. I could have kept both cases the same and avoided it in both by using an address relative to a different register (e.g. rbp) to address the location where call/ret store/reload the return address.

I don't think the variable latency of store-forwarding (worse in simple back-to-back reload right away cases) is sufficient to explain the difference. Adding a redundant assignment speeds up code when compiled without optimization. This is a factor of 2 speedup from breaking the dependency. (0.99 IPC vs. 0.53 IPC, with the same instructions just different addressing mode.)

The instructions are 1 byte longer with the disp8 in the addressing mode, and there was front-end weirdness with alignment in the faster version, but moving things around doesn't seem to change anything with the [rsp+16] version.


Using a version that creates a store-forwarding stall (with add dword [rsp], 0) makes the dep chain too long for OoO exec to hide easily. I didn't play around with this a huge amount.

这篇关于一系列x86调用/退出指令是否形成从属链?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆