为什么循环迭代中的依赖不能与前一个一起执行 [英] Why dependency in a loop iteration can't be executed together with the previous one

查看:21
本文介绍了为什么循环迭代中的依赖不能与前一个一起执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用此代码来测试依赖项在 IvyBridge 上的循环迭代中的影响:

全局 _start_开始:mov rcx, 1000000000.for_loop:公司 rax ;操作A公司 rax ;操作B十二月运维Cjnz .for_loopxor rdi, rdimov rax, 60 ;_退出(0)系统调用

由于 decjnz 将被宏融合到单个 uop,因此我的循环中有 3 个 uops,它们在注释中进行了标记.

uop B 依赖于 uop A,所以我认为执行应该是这样的:

A CB A C ;前一个 B 和当前 A 可以在同一个循环中乙丙...乙丙乙

因此循环可以每迭代执行 1 个周期.

但是,perf 工具显示:

 2,009,704,779 个周期1,008,054,984 停顿周期前端 # 50.16% 前端周期空闲

所以它是每个迭代 2 个周期,并且有 50% 的前端周期空闲.

是什么导致前端闲置了 50%?为什么无法实现假设的执行图?

解决方案

B 和 A 形成循环携带的依赖链.A 不能在下一次迭代中运行,直到它有前一次 B 的结果.

任何给定的 B 永远不会与 A 在同一个循环中运行:如果前一个没有产生,后一个将使用什么输入有结果了吗?

这条链有 2 个周期长(每次迭代),因为 inc 的延迟是 1 个周期.这会在后端造成延迟瓶颈,乱序执行无法隐藏.(除了非常低的迭代次数,因为它可以与循环后的代码重叠).

就像如果你完全展开一个巨大的times 102400 inc eax链,CPU 无法在每条指令都依赖于前一条指令的指令链之间找到指令级并行性.>

宏融合的 dec rcx/jnz uop 独立于 RAX 链,并且是一个较短的链(每次迭代只有 1 个周期,只有 1 个 dec&branch uop,具有 1c 延迟).因此它可以与 B 或 A uops 并行运行.


参见我在另一个问题上的回答问题更多关于指令级并行性和依赖链的概念,以及CPU如何利用这种并行性来并行独立运行指令.

Agner Fog 的 microarch PDF 在早期章节中通过示例展示了这一点:第 2 章:出顺序执行(除 P1 之外的所有处理器,PMMX).


如果您每次迭代都启动一个新的 2 周期 dep 链,它将按您的预期运行.每次迭代分叉的新链将暴露 CPU 的指令级并行性,以防止 AB 同时进行不同的迭代.

.for_loop:异或 eax,eax ;打破 RAX 的依赖公司 rax ;操作A公司 rax ;操作B十二月运维Cjnz .for_loop

Sandybridge-family 在没有执行单元的情况下处理异或归零,因此循环中仍然只有 3 个未融合域的 uops,因此 IvyBridge 有足够的 ALU 执行端口在单个周期中运行所有 3 个.这也使前端达到每个时钟 4 个融合域 uops.

或者,如果您更改 A 以在 RAX 中使用任何无条件覆盖 RAX 的指令在 RAX 中启动一个新的 dep 链,而不依赖于 inc 的结果,您会没事的.

 lea rax, [rdx + rdx] ;上次迭代不依赖于 B公司 rax ;操作B

除了一些不幸的输出依赖的指令:为什么要打破输出依赖"?LZCNT 的问题?

 popcnt rax, rdx ;对 RAX 的错误依赖,3 周期延迟公司 rax ;操作B

在 Intel CPU 上,只有 popcntlzcnt/tzcnt 无缘无故地具有输出依赖性.这是因为它们使用与 bsf/bsr 相同的执行单元,如果输入为零,则在 Intel 和 AMD CPU 上不会修改目标.如果 BSF/<的输入为零,英特尔仍然只在纸上将其记录为未定义a href="https://www.felixcloutier.com/x86/BSR.html" rel="nofollow noreferrer">BSR,但他们构建的硬件实现了更强的保证.(AMD 甚至记录了这种 BSF/BSR 行为.)无论如何,所以 Intel 的 BSF/BSR 就像 CMOV,并且需要目标作为输入,以防源 reg 为 0.popcnt,(和 lzcnt/tzcnt 在 Skylake 之前的版本中)也受到此影响.


如果您使循环超过 5 个融合域 uops,则 SnB/IvB 最多可以从前端每 2 个周期发出 1 个.Haswell 和后来的展开"在循环缓冲区或其他东西中,因此 5 uop 循环可以在每次迭代约 1.25 c 时运行,但 SnB/IvB 不会.执行 uop 计数不是处理器宽度倍数的循环时性能是否会降低?

自 Core 2 起,Intel CPU 的前端问题/重命名阶段是 4 个融合域 uops.

I use this code to test the impact of dependency in a loop iteration on IvyBridge:

global _start
_start:
    mov rcx,    1000000000
.for_loop:          
    inc rax     ; uop A
    inc rax     ; uop B
    dec rcx     ; uop C
    jnz .for_loop   

    xor rdi,    rdi
    mov rax,    60  ; _exit(0)
    syscall

Since dec and jnz will be macro-fused to a single uop, there are 3 uops in my loop, they are labeled in the comments.

uop B depends on uop A, so I think the execution would be like this:

A C
B A C  ; the previous B and current A can be in the same cycle
B A C
...
B A C
B

Therefore the loop can be executed 1 cycle per iter.

However, the perf tool shows:

 2,009,704,779      cycles                
 1,008,054,984      stalled-cycles-frontend   #   50.16% frontend cycles idl

So it's 2 cycle per iter, and there are 50% frontend cycle idle.

What caused the frontend 50% idle? Why the hypothetical execution diagram can't be realized?

解决方案

B and A form a loop-carried dependency chain. A in the next iteration can't run until it has the result of B in the previous.

Any given B can never run in the same cycle as an A: what input would the later one use, if the earlier one hasn't produced a result yet?

This chain is 2 cycles long (per iteration), because the latency of inc is 1 cycle. This creates a latency bottleneck in the back-end that out-of-order execution can't hide. (Except for very low iteration counts where it can overlap it with code after the loop).

Just like if you fully unrolled a huge chain of times 102400 inc eax, there's no instruction-level parallelism for the CPU to find between a chain of instructions that each depend on the previous.

The macro-fused dec rcx/jnz uop is independent of the RAX chain, and is a shorter chain (only 1 cycle per iteration, being only 1 dec&branch uop with 1c latency). So it can run in parallel with B or A uops.


See my answer on another question for more about the concept of instruction-level parallelism and dependency chains, and how CPUs exploit that parallelism to run instruction in parallel when they're independent.

Agner Fog's microarch PDF shows this with examples in an early chapter: Chapter 2: Out-of-order execution (All processors except P1, PMMX).


If you started a new 2-cycle dep chain every iteration, it would run as you expect. A new chain forking off every iteration would expose instruction-level parallelism for the CPU to keep A and B from different iterations in flight at the same time.

.for_loop:
    xor eax,eax          ; dependency-breaking for RAX
    inc rax     ; uop A
    inc rax     ; uop B
    dec rcx     ; uop C
    jnz .for_loop   

Sandybridge-family handles xor-zeroing without an execution unit, so this is still only 3 unfused-domain uops in the loop, so IvyBridge has enough ALU execution ports to run all 3 in a single cycle. This also maxes out the front-end at 4 fused-domain uops per clock.

Or if you changed A to start a new dep chain in RAX with any instruction that unconditionally overwrites RAX without depending on the result of the inc, you'd be fine.

    lea  rax, [rdx + rdx]     ; no dependency on B from last iter
    inc  rax                  ; uop B

Except for a couple instructions with an unfortunate output dependency: Why does breaking the "output dependency" of LZCNT matter?

    popcnt rax, rdx        ; false dependency on RAX, 3 cycle latency
    inc  rax               ; uop B

On Intel CPUs, only popcnt, and lzcnt/tzcnt have an output dependency for no reason. It's because they use the same execution unit as bsf/bsr, which leave the destination unmodified if the input is zero, on Intel and AMD CPUs. Intel still only documents it on paper as undefined if the input is zero for BSF/BSR, but they build hardware that implements stronger guarantees. (AMD does even document this BSF/BSR behaviour.) Anyway, so Intel's BSF/BSR are like CMOV, and need the destination as an input in case the source reg is 0. popcnt, (and lzcnt/tzcnt on pre-Skylake) suffer from this, too.


If you made the loop more than 5 fused-domain uops, SnB/IvB could issue it at best 1 per 2 cycles from the front-end. Haswell and later "unroll" in the loop buffer or something so a 5 uop loop can run at ~1.25 c per iteration, but SnB/IvB don't. Is performance reduced when executing loops whose uop count is not a multiple of processor width?

The front-end issue/rename stage is 4 fused-domain uops wide in Intel CPUs since Core 2.

这篇关于为什么循环迭代中的依赖不能与前一个一起执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆