是 <比 &lt;=? [英] Is &lt; faster than &lt;=?

查看:16
本文介绍了是 <比 &lt;=?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

if (a <901) 是否比 if (a <= 900) 快?

与这个简单示例中的不完全相同,但循环复杂代码的性能略有变化.我想这必须对生成的机器代码做一些事情,以防万一.

解决方案

不,在大多数架构上它不会更快.您没有指定,但在 x86 上,所有积分比较通常都将在两条机器指令中实现:

  • 一条 testcmp 指令,用于设置 EFLAGS
  • 还有一个Jcc(跳转)指令, 取决于比较类型(和代码布局):
  • jne - 如果不相等则跳转 -->ZF = 0
  • jz - 如果为零(等于)则跳转 -->ZF = 1
  • jg - 如果更大则跳转 -->ZF = 0 和 SF = OF
  • (等等...)

示例(为简洁起见编辑)使用 $ gcc -m32 -S -masm=intel test.c

编译

 if (a < b) {//做某事 1}

编译为:

 mov eax, DWORD PTR [esp+24] ;一种cmp eax, DWORD PTR [esp+28] ;乙jge .L2 ;如果 a 是 >= b 则跳转;做点什么 1.L2:

 if (a <= b) {//做一些事情 2}

编译为:

 mov eax, DWORD PTR [esp+24] ;一种cmp eax, DWORD PTR [esp+28] ;乙jg .L5 ;如果 a 是 > 则跳转乙;做点什么 2.L5:

所以两者之间的唯一区别是 jgjge 指令.两者将花费相同的时间.


我想解决以下评论,即没有任何内容表明不同的跳转指令需要相同的时间.这个回答有点棘手,但这是我可以给出的:在 Intel 指令集参考,它们都组合在一个共同指令下,Jcc(满足条件跳转).在 优化参考手册,在附录 C. 延迟和吞吐量中.

<块引用>

延迟 - 所需的时钟周期数执行核心完成所有形成的μops的执行一条指令.

<块引用>

吞吐量 - 所需的时钟周期数在发出端口可以自由接受相同指令之前等待再次.对于许多指令,一条指令的吞吐量可以是明显低于其延迟

Jcc 的值是:

 延迟吞吐量Jcc 不适用 0.5

Jcc 上有以下脚注:

<块引用>

  1. 条件跳转指令的选择应基于第 3.4.1 节分支预测优化"的建议,以提高分支的可预测性.当分支预测成功时,jcc 的延迟实际上为零.

因此,英特尔文档中的任何内容都没有将一个 Jcc 指令与其他指令区别对待.

如果考虑用于实现指令的实际电路,人们可以假设在 EFLAGS 中的不同位上会有简单的 AND/OR 门,以确定是否满足条件.那么,没有理由测试两个位的指令比测试一个位的指令花费更多或更少的时间(忽略门传播延迟,它远小于时钟周期.)


浮点

这也适用于 x87 浮点数:(与上面的代码几乎相同,但使用 double 而不是 int.)

 fld QWORD PTR [esp+32]fld QWORD PTR [esp+40]fucomip st, st(1) ;比较 ST(0) 和 ST(1),并在 EFLAGS 中设置 CF、PF、ZFfstp st(0)刚毛;如果高于(CF=0 和 ZF=0),则设置 al.测试 al, alje.L2;做点什么 1.L2:fld QWORD PTR [esp+32]fld QWORD PTR [esp+40]fucomip st, st(1) ;(和上面一样)fstp st(0)刚毛 ;如果大于或等于 (CF=0),则设置 al.测试 al, alje .L5;做点什么 2.L5:离开退

Is if (a < 901) faster than if (a <= 900)?

Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.

解决方案

No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:

  • A test or cmp instruction, which sets EFLAGS
  • And a Jcc (jump) instruction, depending on the comparison type (and code layout):
  • jne - Jump if not equal --> ZF = 0
  • jz - Jump if zero (equal) --> ZF = 1
  • jg - Jump if greater --> ZF = 0 and SF = OF
  • (etc...)

Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c

    if (a < b) {
        // Do something 1
    }

Compiles to:

    mov     eax, DWORD PTR [esp+24]      ; a
    cmp     eax, DWORD PTR [esp+28]      ; b
    jge     .L2                          ; jump if a is >= b
    ; Do something 1
.L2:

And

    if (a <= b) {
        // Do something 2
    }

Compiles to:

    mov     eax, DWORD PTR [esp+24]      ; a
    cmp     eax, DWORD PTR [esp+28]      ; b
    jg      .L5                          ; jump if a is > b
    ; Do something 2
.L5:

So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.


I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.

Latency — The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction.

Throughput — The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many instructions, the throughput of an instruction can be significantly less than its latency

The values for Jcc are:

      Latency   Throughput
Jcc     N/A        0.5

with the following footnote on Jcc:

  1. Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, "Branch Prediction Optimization," to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.

So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.

If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)


Edit: Floating Point

This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)

        fld     QWORD PTR [esp+32]
        fld     QWORD PTR [esp+40]
        fucomip st, st(1)              ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
        fstp    st(0)
        seta    al                     ; Set al if above (CF=0 and ZF=0).
        test    al, al
        je      .L2
        ; Do something 1
.L2:

        fld     QWORD PTR [esp+32]
        fld     QWORD PTR [esp+40]
        fucomip st, st(1)              ; (same thing as above)
        fstp    st(0)
        setae   al                     ; Set al if above or equal (CF=0).
        test    al, al
        je      .L5
        ; Do something 2
.L5:
        leave
        ret

这篇关于是 <比 &lt;=?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆