什么是局部标志失速? [英] What is a Partial Flag Stall?

查看:80
本文介绍了什么是局部标志失速?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚浏览了彼得·科德斯(Peter Cordes)的答案,他说,

部分标志停顿在读取标志时发生(如果它们确实发生). P4永远不会有部分标志停顿,因为它们永远不需要合并.相反,它具有错误的依赖关系.几个答案/评论混淆了术语.它们描述了一个错误的依赖关系,但随后将其称为部分标志停顿.这是由于仅写入一些标志而导致的速度下降,但是术语部分标志停顿"是指必须合并部分标志写入时在SnB之前的Intel硬件上发生的情况.英特尔SnB系列CPU插入了一个额外的uop来合并标志而不会停顿. Nehalem和更早的失速约7个周期.我不确定AMD CPU的损失有多大.

Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead. Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

我感觉还不明白什么是部分国旗摊位".我怎么知道一个人发生了?读取标志时,除了有时之外,还会触发什么事件?合并标志是什么意思?在什么条件下写了一些标志",但不发生部分标志合并?我需要了解哪些有关旗位的知识才能理解它们?

I don't feel like I understand yet what a "partial flag stall" is. How do I know one has occurred? What triggers the event other than sometimes when flags are read? What does it mean to merge flags? In what condition are "some of the flags written" but a partial-flag merge doesn't happen? What do I need to know about flag stalls to understand them?

推荐答案

通常来说,当使用标志的指令读取一个或多个不是由最新标志设置指令写入的标志时,就会发生部分标志停顿. .

Generally speaking a partial flag stall occurs when a flag-consuming instruction reads one or more flags that were not written by the most recent flag-setting instruction.

因此,像inc这样的仅设置一些标志(未设置CF)的指令不会固有地 引起部分停顿,但是会导致停顿 后续指令读取未由inc设置的标志(CF)(没有任何设置CF标志的中间指令).这也意味着写所有有趣标志的指令永远不会涉及部分停顿,因为当它们是执行标志读取指令时的最新标志设置指令时,它们必须已写入消耗的标志

So an instruction like inc that sets only some flags (it doesn't set CF) doesn't inherently cause a partial stall, but will cause a stall if a subsequent instruction reads the flag (CF) that was not set by inc (without any intervening instruction that sets the CF flag). This also implies that instructions that write all interesting flags are never involved in partial stalls since when they are the most recent flag setting instruction at the point a flag reading instruction is executed, they must have written the consumed flag.

因此,通常,用于静态确定是否会发生部分标志停顿的算法是查看使用这些标志的每条指令(通常是jcc系列和cmovcc,以及一些专门的指令,例如adc ),然后向后走以找到设置 any 标志的第一条指令,并检查它是否设置了使用指令读取的所有标志.否则,将发生部分标志停顿.

So, in general, an algorithm for statically determining whether a partial flags stall will occur is to look at each instruction that uses the flags (generally the jcc family and cmovcc and a few specialized instructions like adc) and then walk backwards to find the first instruction that sets any flag and check if it sets all of the flags read by the consuming instruction. If not, a partial flags stall will occur.

从Sandy Bridge开始的较新的体系结构本身并不会遭受部分 stall 的标记,但仍会受到指令添加到前端的附加uop形式的损失.在某些情况下.与以上讨论的摊位相比,这些规则略有不同,并且适用于一组较窄的案件.特别是,仅当从多个标志读取标志使用指令并且这些标志最后由不同指令设置时,才添加所谓的标志合并uop .例如,这意味着检查单个标志的指令决不会导致发出合并的uop.

Later architectures, starting with Sandy Bridge, don't suffer a partial flags stall per se, but still suffer a penalty in the form of an additional uop added to the front-end by the instruction in some cases. The rules are slightly different and apply to a narrower set of cases compared to the stall discussed above. In particular, the so-calling flag merging uop is added only when a flag consuming instruction reads from multiple flags and those flags were last set by different instructions. This means, for example, that instructions that examine a single flag never cause a merging uop to be emitted.

从Skylake(可能还有Broadwell)开始,我没有发现任何合并uops的证据.取而代之的是,uop格式已扩展为最多可容纳3个输入,这意味着分别重命名的进位标志和重命名的SPAZO组标志都可以用作大多数指令的输入.例外情况包括诸如cmovbe的指令,该指令具有两个寄存器输入,其条件be要求同时使用C标志和一个或多个SPAZO标志.但是,大多数条件移动仅使用C和SPAZO标志中的一个或另一个,并采用一个uop.

Starting from Skylake (and probably starting from Broadwell), I find no evidence of any merging uops. Instead, the uop format has been extended to take up to 3 inputs, meaning that the separately renamed carry flag and the renamed-together SPAZO group flags can both be used as inputs to most instructions. Exceptions include instructions like cmovbe which has two register inputs, and whose condition be requires the use of both the C flag and one or more of the SPAZO flags. Most conditional moves use only one or the other of C and SPAZO flags, however, and take one uop.

这里有一些例子.我们同时讨论了"[partial flag]停顿"和"merge uops",但如上所述,最多只有两者之一适用于任何给定的体系结构,因此应该使用以下内容导致停顿和合并uop发出"之类的东西.可以理解为以下内容导致[在具有部分标志停顿的较旧体系结构上出现停顿]或[在使用合并uops替代的较新体系结构上]出现合并uop".

Here are some examples. We discuss both "[partial flag] stalls" and "merge uops", but as above only at most one of the two applies to any given architecture, so something like "The following causes a stall and a merge uop to be emitted" should be read as "The following causes a stall [on those older architectures which have partial flag stalls] or a merge uop [on those newer architectures which use merge uops instead]".

以下示例将导致失速和合并的uop在Sandy Bridge和Ivy Bridge上发出,但在Skylake上不会发出:

The following example will cause a stall and merging uop to be emitted on Sandy Bridge and Ivy Bridge, but not on Skylake:

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
ja  label    ; reads CF and ZF

ja指令读取分别由addinc指令最后设置的CFZF,因此插入合并uop以统一由.在停顿的体系结构上,发生停顿的原因是jaCF读取,而最新的标志设置指令未设置该值.

The ja instruction reads CF and ZF which were last set by the add and inc instructions, respectively, so a merge uop is inserted to unify the separately set flags for consumption by ja. On architectures that stall, a stall occurs because ja reads from CF which was not set by the most recent flag setting instruction.

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jc  label    ; reads CF

这会导致停顿,因为如在先前示例中一样,读取的是CF,它不是由最后一个标志设置指令(此处为inc)设置的.在这种情况下,可以通过简单地交换incadd的顺序来避免停顿,因为它们是独立的,然后jc将仅从最近的标志设置操作中读取.不需要合并uop,因为读取的标志(仅CF)全部来自同一add指令.

This causes a stall because as in the prior example CF is read which is not set by the last flag setting instruction (here inc). In this case, the stall could be avoided by simply swapping the order of the inc and add since they are independent and then the jc would read only from the most recent flag setting operation. There is no merge uop needed because the flags read (only CF) all come from the same add instruction.

注意:此案正在辩论中(请参见

Note: This case is under debate (see the comments) - but I cannot test it because I don't find evidence of any merging ops at all on my Skylake.

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jnz  label   ; reads ZF

这里,即使最后一条指令(inc)仅设置了一些标志,也不需要停顿或合并uop,因为使用中的jnz仅读取由inc设置的标志(的子集),而没有其他.因此,这种常见的循环习惯用法(通常使用dec而不是inc)本质上不会引起问题.

Here there is no stall or merging uop needed, even though the last instruction (inc) only sets some flags, because the consuming jnz only reads (a subset of) flags set by the inc and no others. So this common looping idiom (usually with dec instead of inc) doesn't inherently cause a problem.

这是另一个不会导致停顿或合并uop的示例:

Here's another example that doesn't cause any stall or merge uop:

inc rax      ; sets ZF, but not CF
add rbx, 5   ; sets CF, ZF, others
ja  label    ; reads CF and ZF

在这里ja确实读取了CFZF,并且存在一个未设置ZFinc(即部分标志写入指令),但是没有问题,因为addinc之后,并写入所有相关标志.

Here the ja does read both CF and ZF and an inc is present which doesn't set ZF (i.e., a partial flag writing instruction), but there is no problem because the add comes after the inc and writes all the relevant flags.

以可变和固定计数形式出现的移位指令sarshrshl的行为与上述行为不同(通常更差),并且在整个体系结构中变化很多.这可能是由于它们奇怪且不一致的标志处理 1 .例如,在许多体系结构上,在移位计数为1以外的移位指令后读取 any 标志时,会出现部分标志停顿的情况.即使在最新的体系结构上,变量移位的成本也高达3由于进行了标志处理(因此不再有停顿").

The shift instructions sar,shr and shl in both their variable and fixed count forms behave differently (generally worse) than described above and this varies a fair amount across architectures. This is probably due to their weird and inconsistent flag handling1. For example, on many architectures there is something like a partial flags stall when reading any flag after a shift instruction with a count other than 1. Even on the most recent architectures variable shifts have a significant cost of 3 uops due to flag handling (but there is no more "stall").

我不会在此处包括所有细节,但是我建议在Agner的

I'm not going to include all the gory details here, but I'd recommend looking for the word shift in Agner's microarch doc if you want all the details.

在某些情况下,某些轮换指令也具有与标志相关的有趣行为,类似于移位.

Some rotate instructions also have interesting flag related behavior in some cases similar to shifts.

1 例如,根据移位计数是0、1还是其他某个值来设置不同的标志子集.

1 For example, setting different subsets of flags depending on whether the shift count is 0, 1 or some other value.

这篇关于什么是局部标志失速?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆