INC指令VS加1:什么关系呢? [英] INC instruction vs ADD 1: Does it matter?

查看:229
本文介绍了INC指令VS加1:什么关系呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大多数情况下,我远离INC和DEC走了,因为他们做的部分
  条件code更新,这可能会导致在滑稽摊位
  管道和ADD / SUB没有。那么,它并不重要(大部分地方),
  我用ADD / SUB避免摊位。我用INC / DEC只保留当
  code小问题,例如,在一个高速缓存行,其中拟合的大小
  一个或两个指令赚了足够的差异关系。这是
  大概没有意义的纳米[字面!] - 优化,但我pretty
  老派在我的编码习惯。

Mostly, I stay away from INC and DEC now, because they do partial condition code updates, and this can cause funny stalls in the pipeline, and ADD/SUB don't. So where it doesn't matter (most places), I use ADD/SUB to avoid the stalls. I use INC/DEC only when keeping the code small matters, e.g., fitting in a cache line where the size of one or two instructions makes enough difference to matter. This is probably pointless nano[literally!]-optimization, but I'm pretty old-school in my coding habits.

作者:@Ira巴克斯特

author: @Ira Baxter

以上片段来自<一个href=\"http://stackoverflow.com/questions/13435142/why-the-inc-and-dec-instructions-do-not-affect-the-carry-flag\">why INC和DEC的指令不影响进位标志?

和我想问一下为什么它可以在管道引起的摊位,而加不?毕竟,无论是增加和更新公司标志寄存器。唯一的区别是,增量不更新CF.但为什么它很重要?

And I would like to ask why it can cause stalls in the pipeline while add doesn't? After all, both add and inc updates flag registers. The only difference is that inc doesn't update CF. But why it matters?

推荐答案

这是从奔腾4遗留在现代CPU陈旧的优化建议,如果你真正阅读,这不是由最后写了一个标志这里只有一个问题insn的那写的任何的标志。 例如在BigInteger的 ADC 循环。

This is stale optimization advice left over from Pentium 4. On modern CPUs, there's only a problem if you actually read a flag that wasn't written by the last insn that wrote any flags. e.g. in BigInteger adc loops.

现代的CPU,包括Intel的P6微(的PPro /酷睿2 / Nehalem处理器),以及SandyBridge的家庭。还AMD K8和购买(可能K7,太,IDK),并且分别命名EFLAGS的不同部分的任何其它的CPU。

"modern" CPUs includes Intel's P6 microarchitecture (PPro / Core2 / Nehalem), and Sandybridge-family. Also AMD K8 and later (and probably K7, too, IDK), and any other CPU that renames different parts of EFLAGS separately.

一些现代编译器(包括<一个href=\"http://gcc.godbolt.org/#compilers:!((compiler:clang380,options:'-xc+-Wall+-Wextra+-std%3Dgnu11+-fverbose-asm+-O3+-mtune%3Dintel',source:'int+global_a%3B%0Avoid+foo()+%7B+%2B%2Bglobal_a%3B+%7D%0A%0Aint+inc_in_regs(int+a,+int+b)+%7B%0A++%2B%2Ba%3B+%2B%2Bb%3B%0A++return+a+*+b%3B%0A%7D')),filterAsm:(commentOnly:!t,directives:!t,intel:!t,labels:!t),version:3\">clang-3.8,和<一个href=\"http://gcc.godbolt.org/#compilers:!((compiler:icc1301,options:'-xc+-Wall+-Wextra+-std%3Dgnu11+-fverbose-asm+-O3+-mtune%3Dintel',source:'int+global_a%3B%0Avoid+foo()+%7B+%2B%2Bglobal_a%3B+%7D%0A%0Aint+inc_in_regs(int+a,+int+b)+%7B%0A++%2B%2Ba%3B+%2B%2Bb%3B%0A++return+a+*+b%3B%0A%7D')),filterAsm:(commentOnly:!t,directives:!t,intel:!t,labels:!t),version:3\">Intel's ICC 13 )并使用速度优化时( -O3 ),不只是大小< INC / STRONG>,当他们不需要事后检查进位标志。这样可以节省1个字节(64位模式),或2个字节( INC R32 在32位模式下短格式),这使得总code尺寸很小的比例不同。

Some modern compilers (including clang-3.8, and Intel's ICC 13) do use inc when optimizing for speed (-O3), not just for size, when they don't need to check the carry flag afterwards. This saves 1 byte (64bit mode), or 2 bytes (inc r32 short form in 32bit mode), which makes a small percentage difference in total code size.

在P4 INC 依赖于previous价值标志的,因此它们不能在彼此平行或preceding标志设定指令的执行。 (例如:添加EAX,[存储] / INC ECX ,使 INC 等到以后添加,即使添加的加载未命中高速缓存)。这就是所谓的假依赖

On P4, inc and dec depend on the previous value of the flags, so they can't execute in parallel with each other or preceding flag-setting instructions. (e.g. add eax, [mem] / inc ecx makes the inc wait until after the add, even if the add's load misses in cache.) This is called a false dependency.

部分标志摊位发生,如果发生在所有。 P4从来没有偏旗档,因为他们从来没有需要合并。它有假相关性来代替。

Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead.

若干答案/评论混淆的术语和描述了一个虚假的依赖,但后来称之为局部标志失速。这是这是因为只写一些标志的放缓,但局部标志的摊位是pre-SNB Intel硬件时会发生什么部分标志写入必须合并。英特尔SNB-系列CPU插入一个额外的UOP公司合并的标志,而没有失速。 Nehalem处理器和〜7个周期早期的摊位。我不知道处罚有多大是AMD的CPU。

Several answers / comments mix up the terminology and describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

### Partial flag stall on Intel P6-family CPUs:
bigint_loop:
    adc   eax, [array_end + rcx*4]   # partial-flag stall when adc reads CF 
    inc   rcx                        # rcx counts up from negative values towards zero
    # test rcx,rcx  # uncomment to eliminate partial-flag stalls by writing all flags
    jnz
# this loop doesn't do anything useful; it's not normally useful to loop the carry-out back to the carry-in for the same accumulator.
# Note that `test` will change the input to the next adc, and so would replacing inc with add 1


使用 INC 前的变量数移位指令有导致部分标志失速,因为86的愚蠢CISC语义说 SHL章, CL 如果移位计数结果是零不更新标志。所以insn为3微指令,与旧的标志作为一个输入。我可能会检查这一点,并更新这个答案。


Using inc before a variable-count shift instruction could possibly create a partial-flag stall, since x86's stupid CISC semantics say that shl reg, cl doesn't update flags if the shift count turns out to be zero. So the insn is 3 uops, with the old flags as one of the inputs. I'll probably check on this and update this answer.

在其他情况下,例如部分标志写入后跟一个满标志写,或者只能通过 INC 写标志的读操作,是好的。在SNB-系列CPU,<一个href=\"http://stackoverflow.com/questions/31771526/x86-64-assembly-loop-conditions-and-out-of-order/31778403#31778403\"><$c$c>inc/dec甚至可以宏观保险丝以江铜,同为添加/分

In other cases, e.g. a partial flag write followed by a full flag write, or a read of only flags written by inc, is fine. On SnB-family CPUs, inc/dec can even macro-fuse with a jcc, the same as add/sub.

P4之后,CPU设计师决定放弃使人们改变code,而是(通过单独重命名EFLAGS的部分)制成的硬件支持。这类似于英特尔P6 / SNB-家庭如何重新命名部分寄存器: INC啊并没有关于的EAX <旧值虚假的依赖/ code>,但阅读 EAX 后,会导致部分寄存器延迟(的PPro / PIII摊位5-6个周期。酷睿档位只有2次或3次并插入局部暂存器一个合并微指令,而不是局部的标志。SNB家族中插入合并微指令,而没有失速,像标志。Haswell的,没有额外的微指令或摊位合并。瓦格纳雾推测的Haswell可能会做某种双重簿记。)请参阅瓦格纳雾的microarch指南。 AMD的CPU和Intel Silvermont,不要重命名(国旗以外的)部分的暂存器,因此 MOV人,[存储] 对EAX的旧值虚假的依赖。 (好处是没有局部章读书迟全章合并时减速。)

After P4, CPU designers decided to give up on making people change code, and instead made hardware support it (by renaming parts of EFLAGS separately). This is similar to how Intel P6/SnB-families rename partial registers: inc ah doesn't have a false dependency on the old value of eax, but reading eax after that will cause a partial register stall (PPro/PIII stall for 5-6 cycles. Core2 stalls for only 2 or 3 cycles and inserts a merging uop for partial regs, but not partial flags. SnB-family inserts a merging uop without stalling, like for flags. Haswell merges with no extra uops or stalls. Agner Fog speculates that Haswell may do dual-bookkeeping of some sort.) See Agner Fog's microarch guide. AMD CPUs, and Intel Silvermont, don't rename partial regs (other than flags), so mov al, [mem] has a false dependency on the old value of eax. (The upside is no partial-reg merging slowdowns when reading the full reg later.)

通常情况下,只有时间添加而不是 INC 会让你的code更快的是,当你的$ C $ç实际上依赖于 INC 的doesn't触摸-CF的行为。即唯一的一次,这将有助于为它何时会打破你的code 。 pre SNB-系列CPU有严重的问题,部分旗档,但SNB家族具有CPU合并部分标志的开销非常低,所以它可以最好使用保持 INC 目标定位的CPU时。 (有关详细信息,请参阅第一款的BigInteger的链接)。

Normally, the only time add instead of inc will make your code faster is when your code actually depends on the doesn't-touch-CF behaviour of inc. i.e. the only time it would help is when it would break your code. Pre SnB-family CPUs have serious problems with partial-flag stalls, but on SnB-family the overhead of having the CPU merge the partial flags is very low, so it can be best to keep using inc when targetting those CPU. (For details, see that BigInteger link in the first paragraph).

这篇关于INC指令VS加1:什么关系呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆