什么是部分旗帜摊位? [英] What is a Partial Flag Stall?

查看:23
本文介绍了什么是部分旗帜摊位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚回顾了Peter Cordes 的这个答案,他说,

部分标志停顿在读取标志时发生,如果它们发生的话.P4 永远不会有部分标志停顿,因为它们永远不需要合并.相反,它具有错误的依赖关系.几个答案/评论混淆了术语.他们描述了一个错误的依赖关系,但随后将其称为部分标志停顿.这是由于仅写入一些标志而发生的减速,但术语部分标志停顿"是在必须合并部分标志写入时在 SnB 之前的 Intel 硬件上发生的情况.Intel SnB 系列 CPU 插入一个额外的 uop 来合并标志而不会停顿.Nehalem 和更早的停止约 7 个周期.我不确定对 AMD CPU 的惩罚有多大.

Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead. Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

我觉得我还不明白什么是部分旗帜摊位".我怎么知道已经发生了?当读取标志时,除了 sometimes 之外,是什么触发了事件?合并标志是什么意思?在什么情况下写入了一些标志"但不会发生部分标志合并?我需要了解什么关于标志摊位才能理解它们?

I don't feel like I understand yet what a "partial flag stall" is. How do I know one has occurred? What triggers the event other than sometimes when flags are read? What does it mean to merge flags? In what condition are "some of the flags written" but a partial-flag merge doesn't happen? What do I need to know about flag stalls to understand them?

推荐答案

一般来说,当使用标志的指令读取一个或多个最近的标志设置指令未写入的标志时,会发生部分标志停顿.

Generally speaking a partial flag stall occurs when a flag-consuming instruction reads one or more flags that were not written by the most recent flag-setting instruction.

所以像 inc 这样只设置一些标志(它不设置 CF)的指令不会固有地导致部分停顿,但会导致停顿 if 后续指令读取未由 inc 设置的标志 (CF)(没有任何设置CF 标志).这也意味着写入所有感兴趣标志的指令永远不会涉及部分停顿,因为当它们是执行标志读取指令时最近的标志设置指令时,它们必须已经写入了消耗的标志.

So an instruction like inc that sets only some flags (it doesn't set CF) doesn't inherently cause a partial stall, but will cause a stall if a subsequent instruction reads the flag (CF) that was not set by inc (without any intervening instruction that sets the CF flag). This also implies that instructions that write all interesting flags are never involved in partial stalls since when they are the most recent flag setting instruction at the point a flag reading instruction is executed, they must have written the consumed flag.

所以,一般来说,静态确定是否会发生部分标志停顿的算法是查看使用标志的每条指令(通常是 jcc 系列和 cmovcc 和一些专门的指令,如 adc),然后向后走,找到设置 any 标志的第一条指令,并检查它是否设置了消费指令读取的所有标志.否则,将发生部分标志停顿.

So, in general, an algorithm for statically determining whether a partial flags stall will occur is to look at each instruction that uses the flags (generally the jcc family and cmovcc and a few specialized instructions like adc) and then walk backwards to find the first instruction that sets any flag and check if it sets all of the flags read by the consuming instruction. If not, a partial flags stall will occur.

后来的架构,从 Sandy Bridge 开始,本身不会遭受部分标志stall,但仍然会受到指令添加到前端的额外 uop 形式的惩罚在某些情况下.与上面讨论的停顿相比,规则略有不同,适用于范围更窄的情况.特别是,所谓的flag merging uop 仅在一个标志消耗指令从多个标志中读取并且这些标志最后由不同指令设置时才被添加.这意味着,例如,检查单个标志的指令永远不会导致发出合并 uop.

Later architectures, starting with Sandy Bridge, don't suffer a partial flags stall per se, but still suffer a penalty in the form of an additional uop added to the front-end by the instruction in some cases. The rules are slightly different and apply to a narrower set of cases compared to the stall discussed above. In particular, the so-calling flag merging uop is added only when a flag consuming instruction reads from multiple flags and those flags were last set by different instructions. This means, for example, that instructions that examine a single flag never cause a merging uop to be emitted.

从 Skylake 开始(可能从 Broadwell 开始),我没有发现任何合并 uops 的证据.相反,uop 格式已扩展为最多接受 3 个输入,这意味着单独重命名的进位标志和重命名的 SPAZO 组标志都可以用作大多数指令的输入.例外情况包括像 cmovbe 这样的指令,它有两个寄存器输入,其条件 be 需要both C 标志和一个或多个SPAZO 标志.然而,大多数条件移动仅使用 C 和 SPAZO 标志中的一个或另一个,并且占用一个 uop.

Starting from Skylake (and probably starting from Broadwell), I find no evidence of any merging uops. Instead, the uop format has been extended to take up to 3 inputs, meaning that the separately renamed carry flag and the renamed-together SPAZO group flags can both be used as inputs to most instructions. Exceptions include instructions like cmovbe which has two register inputs, and whose condition be requires the use of both the C flag and one or more of the SPAZO flags. Most conditional moves use only one or the other of C and SPAZO flags, however, and take one uop.

这里有一些例子.我们讨论了[partial flag] 停顿"和合并 uop",但如上所述,两者中最多只有一个适用于任何给定的架构,所以像下面会导致一个停顿和合并 uop 被发出"之类的东西应该读作以下原因导致停顿 [在那些具有部分标志停顿的旧架构上] 或合并 uop [在那些使用合并 uops 的较新架构上]".

Here are some examples. We discuss both "[partial flag] stalls" and "merge uops", but as above only at most one of the two applies to any given architecture, so something like "The following causes a stall and a merge uop to be emitted" should be read as "The following causes a stall [on those older architectures which have partial flag stalls] or a merge uop [on those newer architectures which use merge uops instead]".

以下示例将导致在 Sandy Bridge 和 Ivy Bridge 上发出停顿和合并 uop,但不会在 Skylake 上发出:

The following example will cause a stall and merging uop to be emitted on Sandy Bridge and Ivy Bridge, but not on Skylake:

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
ja  label    ; reads CF and ZF

ja 指令读取 CFZF 最后由 addinc 设置 指令,因此插入合并 uop 以统一 ja 使用的单独设置的标志.在停顿的架构上,会发生停顿,因为 jaCF 读取,而这不是由最近的标志设置指令设置的.

The ja instruction reads CF and ZF which were last set by the add and inc instructions, respectively, so a merge uop is inserted to unify the separately set flags for consumption by ja. On architectures that stall, a stall occurs because ja reads from CF which was not set by the most recent flag setting instruction.

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jc  label    ; reads CF

这会导致停顿,因为在前面的示例中,CF 不是由最后一个标志设置指令(这里是 inc)设置的.在这种情况下,可以通过简单地交换 incadd 的顺序来避免停顿,因为它们是独立的,然后 jc 将读取仅来自最近的标志设置操作.不需要合并 uop,因为读取的标志(仅 CF)都来自相同的 add 指令.

This causes a stall because as in the prior example CF is read which is not set by the last flag setting instruction (here inc). In this case, the stall could be avoided by simply swapping the order of the inc and add since they are independent and then the jc would read only from the most recent flag setting operation. There is no merge uop needed because the flags read (only CF) all come from the same add instruction.

注意:此案例正在辩论中(参见 评论) - 但我无法对其进行测试,因为我在 Skylake 上根本找不到任何合并操作的证据.

Note: This case is under debate (see the comments) - but I cannot test it because I don't find evidence of any merging ops at all on my Skylake.

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jnz  label   ; reads ZF

这里不需要暂停或合并 uop,即使最后一条指令 (inc) 只设置了一些标志,因为消耗的 jnz 只读取() 由 inc 设置的标志,没有其他标志.因此,这种常见的循环习惯用法(通常使用 dec 而不是 inc)本身不会导致问题.

Here there is no stall or merging uop needed, even though the last instruction (inc) only sets some flags, because the consuming jnz only reads (a subset of) flags set by the inc and no others. So this common looping idiom (usually with dec instead of inc) doesn't inherently cause a problem.

这是另一个不会导致任何停顿或合并 uop 的示例:

Here's another example that doesn't cause any stall or merge uop:

inc rax      ; sets ZF, but not CF
add rbx, 5   ; sets CF, ZF, others
ja  label    ; reads CF and ZF

这里 ja 确实读取了 CFZF 并且存在一个没有设置的 incZF(即部分标志写入指令),但是没有问题,因为addinc之后,并写入所有相关标志.

Here the ja does read both CF and ZF and an inc is present which doesn't set ZF (i.e., a partial flag writing instruction), but there is no problem because the add comes after the inc and writes all the relevant flags.

移位指令 sarshr​​shl 在它们的可变和固定计数形式中的行为与上述不同(通常更糟),并且这在不同架构之间有很大差异.这可能是由于它们奇怪且不一致的标志处理1.例如,在许多体系结构上,在计数不是 1 的移位指令之后读取 any 标志时,会出现部分标志停顿之类的情况.即使在最新的体系结构中,变量移位的成本也很高,为 3由于标志处理而导致的 uops(但不再有停顿").

The shift instructions sar,shr and shl in both their variable and fixed count forms behave differently (generally worse) than described above and this varies a fair amount across architectures. This is probably due to their weird and inconsistent flag handling1. For example, on many architectures there is something like a partial flags stall when reading any flag after a shift instruction with a count other than 1. Even on the most recent architectures variable shifts have a significant cost of 3 uops due to flag handling (but there is no more "stall").

我不会在这里包含所有血腥细节,但我建议您在 Agner 的 microarch 文档.

I'm not going to include all the gory details here, but I'd recommend looking for the word shift in Agner's microarch doc if you want all the details.

一些旋转指令在某些情况下也有有趣的标志相关行为,类似于移位.

Some rotate instructions also have interesting flag related behavior in some cases similar to shifts.

1 例如,根据移位计数是 0、1 还是其他值来设置不同的标志子集.

1 For example, setting different subsets of flags depending on whether the shift count is 0, 1 or some other value.

这篇关于什么是部分旗帜摊位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆