在编译器中为8位的布尔值.对它们的操作效率低下吗? [英] Boolean values as 8 bit in compilers. Are operations on them inefficient?

查看:126
本文介绍了在编译器中为8位的布尔值.对它们的操作效率低下吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Agner Fog的"使用C ++优化软件"(特定于x86处理器,用于英特尔,AMD和VIA),并在第34页上声明

布尔变量存储为8位整数,false值为0,true值为1. 在所有具有布尔值的运算符的意义上,布尔变量是超定的 变量作为输入检查输入是否具有除0或1以外的任何其他值,但运算符 具有布尔值作为输出,除了0或1之外,不能产生其他任何值.这使操作成为可能 布尔变量作为输入的效率比必要的低.

今天以及在哪些编译器上仍然如此?你能举个例子吗?作者指出

如果布尔运算可以更高效 可以肯定地知道操作数除0和1外没有其他值.原因 为什么编译器没有做出这样的假设是变量可能具有其他 值(如果未初始化或来自未知源).

这是否意味着如果我以函数指针bool(*)()为例并对其进行调用,则对其进行的操作会产生效率低下的代码?还是我通过取消引用指针或从引用中读取然后对其进行操作来访问布尔值的情况?

解决方案

TL:DR :当前的编译器在执行诸如
(a&&b) ? x : y.但是不是的原因是他们没有假设0/1,他们只是对此感到厌恶.

bool的许多用法用于本地函数或内联函数,因此布尔化为0/1可以在原始条件下优化分支和分支(或cmov或其他功能).只需要优化bool的输入/输出,而当它们必须通过非内联的或真正存储在内存中的东西传递/返回时,就不必担心.

可能的优化准则:将外部来源(函数args/内存)中的bool与按位运算符(如a&b)结合起来. MSVC和ICC在此方面做得更好.如果对于本地bool来说更糟,则为IDK.请注意,对于boola&b仅等效于a&&b,而不是整数类型. 2 && 1是true,但是2 & 1是0,这是错误的.按位或不存在此问题.

IDK(如果该准则对通过函数内的比较(或内联的东西)设置的本地人有害).例如.它可能会导致编译器实际生成整数布尔值,而不仅仅是在可能的情况下直接使用比较结果.还要注意,它对于当前的gcc和clang似乎无济于事.


是的,x86上的C ++实现将bool存储在一个始终为0或1的字节中(至少在函数调用边界上,编译器必须遵守此要求的ABI/调用约定).

编译器有时会利用这一点,例如对于bool-> int转换,即使gcc 4.4也会简单地零扩展为32位(movzx eax, dil). Clang和MSVC也这样做. C和C ++规则要求此转换产生0或1,因此仅当始终假设bool函数arg或全局变量具有0或1值时,此行为才是安全的. /p>

即使是老版本的编译器也确实在bool-> int中使用了它,但在其他情况下则没有.因此,阿格纳说错的原因是错误的:

编译器没有做出这样的假设的原因是,如果变量未初始化或来自未知来源,则它们可能具有其他值.


MSVC CL19确实制作了假定bool函数args为0或1的代码,因此Windows x86-64 ABI必须保证这一点.

x86-64系统V ABI中(除Windows以外的所有版本都使用),修订版0.98的更改日志显示指定_Bool(又名bool)在调用方被布尔化".我认为,即使在进行此更改之前,编译器仍在进行假设,但这只是记录了编译器已经依赖的内容. x86-64 SysV ABI中的当前语言是:

3.1.2数据表示

布尔值存储在存储对象中时,被存储为单字节对象,其值始终为0(假)或1(真).当存储在整数寄存器中(除了作为参数传递时)时,寄存器的所有8个字节都是有效的.任何非零值都被认为是真实的.

第二句话是胡说八道:ABI并没有告诉编译器如何在函数内部的寄存器中存储内容,而只是在不同编译单元(内存/函数args和返回值)之间的边界处进行存储.我之前在在维护它的github页面上报告了此ABI缺陷

3.2.3参数传递:

当类型_Bool的值返回或传递到寄存器或堆栈中时,位0包含真值,位1至7应为零 16 .

(脚注16):其他位未指定,因此这些值的使用方在被截断为8位时可以依靠它为0或1.

i386 System V ABI中的语言是相同的,即IIRC.


任何一件事假设0/1(例如,转换为int)但在其他情况下都无法利用的编译器会出现缺少优化的情况.不幸的是,这种遗漏的优化仍然存在,尽管比阿格纳(Agner)总是在关于编译器的段落总是重新布尔化的情况下稀少.

(在 最近我的编译器为我做了什么?取消编译器的盖子)

bool logical_or(bool a, bool b) { return a||b; }

 # gcc4.6.4 -O3 for the x86-64 System V ABI
    test    dil, dil            # test a against itself (for non-zero)
    mov     eax, 1
    cmove   eax, esi            # return   a ? 1 : b;
    ret

因此,即使gcc4.6也没有重新布尔化b,但确实错过了gcc4.7进行的优化:(还有其他答案中所示的clang和更高版本的编译器):

    # gcc4.7 -O3 to present: looks ideal to me.
    mov     eax, esi
    or      eax, edi
    ret

(Clang的or dil, sil/mov eax, edi很愚蠢:在编写dil后读取edi时,一定会导致Nehalem或更早版本的Intel发生部分寄存器停顿,并且代码大小更差,因为它需要REX前缀以使用edi的低8位部分.如果您希望避免读取任何32位寄存器,以防您的调用者留下一些arg-passing,则更好的选择可能是or dil,sil/movzx eax, dil带有脏"的部分寄存器.)

MSVC发出此代码,分别检查ab,完全无法利用任何东西,甚至使用xor al,al而不是xor eax,eax.因此在大多数CPU上它对eax的旧值有错误的依赖关系(源+ ASM在Godbolt编译器资源管理器 (来源相同,上次选择的是不同的编译器).

看起来很简单;您希望智能编译器可以使用一个test/cmov进行无分支处理. x86的test指令根据按位AND设置标志.这是一个AND指令,实际上并未写入目的地. (就像cmp是不写目的地的sub一样.)

# hand-written implementation that no compilers come close to making
select:
    mov     eax, edx      # retval = x
    test    edi, esi      # ZF =  ((a & b) == 0)
    cmovz   eax, ecx      # conditional move: return y if ZF is set
    ret

但是,即使Godbolt编译器资源管理器上的gcc和clang的日常构建也会使很多变得更加复杂,并分别检查每个布尔值.他们知道如何在返回ab的情况下优化bool ab = a&&b;,但是即使以这种方式编写(使用一个单独的布尔变量来保存结果)也无法使他们编写不会烂的代码.

请注意, test same,same完全等同于cmp reg, 0 ,并且更小,因此它是编译器使用的.

Clang的版本比我的手写版本严格. (请注意,这要求调用方将bool args零扩展到32位,Optimizing software in C++" (specific to x86 processors for Intel, AMD and VIA) and it states on page 34

Boolean variables are stored as 8-bit integers with the value 0 for false and 1 for true. Boolean variables are overdetermined in the sense that all operators that have Boolean variables as input check if the inputs have any other value than 0 or 1, but operators that have Booleans as output can produce no other value than 0 or 1. This makes operations with Boolean variables as input less efficient than necessary.

Is this still true today and on what compilers? Can you please give an example? The author states

The Boolean operations can be made much more efficient if it is known with certainty that the operands have no other values than 0 and 1. The reason why the compiler doesn't make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources.

Does this mean that if I take a function pointer bool(*)() for example and call it, then operations on it produce inefficient code? Or is it the case when I access a boolean by dereferencing a pointer or reading from a reference and then operate on it?

解决方案

TL:DR: current compilers still have bool missed-optimizations when doing stuff like
(a&&b) ? x : y. But the reason why is not that they don't assume 0/1, they just suck at this.

Many uses of bool are for locals, or inline functions, so booleanizing to a 0 / 1 can optimize away and branch (or cmov or whatever) on the original condition. Only worry about optimizing bool inputs / outputs when it does have to get passed/returned across something that doesn't inline, or really stored in memory.

Possible optimization guideline: combine bools from external sources (function args / memory) with bitwise operators, like a&b. MSVC and ICC do better with this. IDK if it's ever worse for local bools. Beware that a&b is only equivalent to a&&b for bool, not integer types. 2 && 1 is true, but 2 & 1 is 0 which is false. Bitwise OR doesn't have this problem.

IDK if this guideline will ever hurt for locals that were set from a comparison within the function (or in something that inlined). E.g. it might lead the compiler to actually make integer booleans instead of just using comparison results directly when possible. Also note that it doesn't seem to help with current gcc and clang.


Yes, C++ implementations on x86 store bool in a byte that's always 0 or 1 (at least across function-call boundaries where the compiler has to respect the ABI / calling convention which requires this.)

Compilers do sometimes take advantage of this, e.g. for bool->int conversion even gcc 4.4 simply zero-extends to 32-bit (movzx eax, dil). Clang and MSVC do this, too. C and C++ rules require this conversion to produce 0 or 1, so this behaviour is only safe if it's always safe to assume that a bool function arg or global variable has a 0 or 1 value.

Even old compilers typically did take advantage of it for bool->int, but not in other cases. Thus, Agner is wrong about the reason when he says:

The reason why the compiler doesn't make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources.


MSVC CL19 does make code that assumes bool function args are 0 or 1, so the Windows x86-64 ABI must guarantee this.

In the x86-64 System V ABI (used by everything other than Windows), the changelog for revision 0.98 says "Specify that _Bool (aka bool) is booleanized at the caller." I think even before that change, compilers were assuming it, but this just documents what compilers were already relying on. The current language in the x86-64 SysV ABI is:

3.1.2 Data Representation

Booleans, when stored in a memory object, are stored as single byte objects the value of which is always 0 (false) or 1 (true). When stored in integer registers (except for passing as arguments), all 8 bytes of the register are significant; any nonzero value is considered true.

The second sentence is nonsense: the ABI has no business telling compilers how to store things in registers inside a function, only at boundaries between different compilation units (memory / function args and return values). I reported this ABI defect a while ago on the github page where it's maintained.

3.2.3 Parameter passing:

When a value of type _Bool is returned or passed in a register or on the stack, bit 0 contains the truth value and bits 1 to 7 shall be zero16.

(footnote 16): Other bits are left unspecified, hence the consumer side of those values can rely on it being 0 or 1 when truncated to 8 bit.

The language in the i386 System V ABI is the same, IIRC.


Any compiler that assumes 0/1 for one thing (e.g. conversion to int) but fails to take advantage of it in other cases has a missed optimization. Unfortunately such missed-optimizations still exist, although they are rarer than when Agner wrote that paragraph about compilers always re-booleanizing.

(Source + asm on the Godbolt compiler explorer for gcc4.6 / 4.7, and clang/MSVC. See also Matt Godbolt's CppCon2017 talk What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid)

bool logical_or(bool a, bool b) { return a||b; }

 # gcc4.6.4 -O3 for the x86-64 System V ABI
    test    dil, dil            # test a against itself (for non-zero)
    mov     eax, 1
    cmove   eax, esi            # return   a ? 1 : b;
    ret

So even gcc4.6 didn't re-booleanize b, but it did miss the optimization that gcc4.7 makes: (and clang and later compilers as shown in other answers):

    # gcc4.7 -O3 to present: looks ideal to me.
    mov     eax, esi
    or      eax, edi
    ret

(Clang's or dil, sil / mov eax, edi is silly: it's guaranteed to cause a partial-register stall on Nehalem or earlier Intel when reading edi after writing dil, and it has worse code size from needing a REX prefix to use the low-8 part of edi. A better choice might be or dil,sil / movzx eax, dil if you want to avoid reading any 32-bit registers in case your caller left some arg-passing registers with "dirty" partial registers.)

MSVC emits this code that checks a then b separately, completely failing to take advantage of anything, and even using xor al,al instead of xor eax,eax. So it has a false dependency on the old value of eax on most CPUs (including Haswell/Skylake, which don't rename low-8 partial regs separately from the whole register, only AH/BH/...). This is just dumb. The only reason to ever use xor al,al is when you explicitly want to preserve the upper bytes.

logical_or PROC                     ; x86-64 MSVC CL19
    test     cl, cl                 ; Windows ABI passes args in ecx, edx
    jne      SHORT $LN3@logical_or
    test     dl, dl
    jne      SHORT $LN3@logical_or
    xor      al, al                 ; missed peephole: xor eax,eax is strictly better
    ret      0
$LN3@logical_or:
    mov      al, 1
    ret      0
logical_or ENDP

ICC18 also doesn't take advantage of the known 0/1 nature of the inputs, it just uses an or instruction to set flags according to the bitwise OR of the two inputs, and setcc to produce a 0/1.

logical_or(bool, bool):             # ICC18
    xor       eax, eax                                      #4.42
    movzx     edi, dil                                      #4.33
    movzx     esi, sil                                      #4.33
    or        edi, esi                                      #4.42
    setne     al                                            #4.42
    ret                                                     #4.42

ICC emits the same code even for bool bitwise_or(bool a, bool b) { return a|b; }. It promotes to int (with movzx), and uses or to set flags according to the bitwise OR. This is dumb compared to or dil,sil / setne al.

For bitwise_or, MSVC does just use an or instruction (after movzx on each input), but anyway doesn't re-booleanize.


Missed optimizations in current gcc/clang:

Only ICC/MSVC were making dumb code with the simple function above, but this function still gives gcc and clang trouble:

int select(bool a, bool b, int x, int y) {
    return (a&&b) ? x : y;
}

Source+asm on the Godbolt compiler explorer (Same source, different compilers selected vs. last time).

Looks simple enough; you'd hope that a smart compiler would do it branchlessly with one test/cmov. x86's test instruction sets flags according to a bitwise AND. It's an AND instruction that doesn't actually write the destination. (Just like cmp is a sub that doesn't write the destination).

# hand-written implementation that no compilers come close to making
select:
    mov     eax, edx      # retval = x
    test    edi, esi      # ZF =  ((a & b) == 0)
    cmovz   eax, ecx      # conditional move: return y if ZF is set
    ret

But even the daily builds of gcc and clang on the Godbolt compiler explorer make much more complicated code, checking each boolean separately. They know how to optimize bool ab = a&&b; if you return ab, but even writing it that way (with a separate boolean variable to hold the result) doesn't manage to hand-hold them into making code that doesn't suck.

Note that test same,same is exactly equivalent to cmp reg, 0, and is smaller, so it's what compilers use.

Clang's version is strictly worse than my hand-written version. (Note that it requires that the caller zero-extended the bool args to 32-bit, like it does for narrow integer types as an unofficial part of the ABI which it and gcc implement but only clang depends on).

select:  # clang 6.0 trunk 317877 nightly build on Godbolt
    test    esi, esi
    cmove   edx, ecx         # x = b ? y : x
    test    edi, edi
    cmove   edx, ecx         # x = a ? y : x
    mov     eax, edx         # return x
    ret

gcc 8.0.0 20171110 nightly makes branchy code for this, similar to what older gcc versions do.

select(bool, bool, int, int):   # gcc 8.0.0-pre   20171110
    test    dil, dil
    mov     eax, edx          ; compiling with -mtune=intel or -mtune=haswell would keep test/jcc together for macro-fusion.
    je      .L8
    test    sil, sil
    je      .L8
    rep ret
.L8:
    mov     eax, ecx
    ret

MSVC x86-64 CL19 makes very similar branchy code. It's targeting the Windows calling convention, where integer args are in rcx, rdx, r8, r9.

select PROC
        test     cl, cl         ; a
        je       SHORT $LN3@select
        mov      eax, r8d       ; retval = x
        test     dl, dl         ; b
        jne      SHORT $LN4@select
$LN3@select:
        mov      eax, r9d       ; retval = y
$LN4@select:
        ret      0              ; 0 means rsp += 0 after popping the return address, not C return 0.
                                ; MSVC doesn't emit the `ret imm16` opcode here, so IDK why they put an explicit 0 as an operand.
select ENDP

ICC18 also makes branchy code, but with both mov instructions after the branches.

select(bool, bool, int, int):
        test      dil, dil                                      #8.13
        je        ..B4.4        # Prob 50%                      #8.13
        test      sil, sil                                      #8.16
        jne       ..B4.5        # Prob 50%                      #8.16
..B4.4:                         # Preds ..B4.2 ..B4.1
        mov       edx, ecx                                      #8.13
..B4.5:                         # Preds ..B4.2 ..B4.4
        mov       eax, edx                                      #8.13
        ret                                                     #8.13


Trying to help the compiler by using

int select2(bool a, bool b, int x, int y) {
    bool ab = a&&b;
    return (ab) ? x : y;
}

leads MSVC into making hilariously bad code:

;; MSVC CL19  -Ox  = full optimization
select2 PROC
    test     cl, cl
    je       SHORT $LN3@select2
    test     dl, dl
    je       SHORT $LN3@select2
    mov      al, 1              ; ab = 1

    test     al, al             ;; and then test/cmov on an immediate constant!!!
    cmovne   r9d, r8d
    mov      eax, r9d
    ret      0
$LN3@select2:
    xor      al, al            ;; ab = 0

    test     al, al            ;; and then test/cmov on another path with known-constant condition.
    cmovne   r9d, r8d
    mov      eax, r9d
    ret      0
select2 ENDP

This is only with MSVC (and ICC18 has the same missed optimization of test/cmov on a register that was just set to a constant).

gcc and clang as usual don't make code as bad as MSVC; they make the same asm they do for select(), which is still not good but at least trying to help them doesn't make it worse like with MSVC.


Combine bool with bitwise operators helps MSVC and ICC

In my very limited testing, | and & seem to work better than || and && for MSVC and ICC. Look at the compiler output for your own code with your compiler + compile options to see what happens.

int select_bitand(bool a, bool b, int x, int y) {
    return (a&b) ? x : y;
}

Gcc still branches separately on separate tests of the two inputs, same code as the other versions of select. clang still does two separate test/cmov, same asm as for the other source versions.

MSVC comes through and optimizes correctly, beating all the other compilers (at least in the stand-alone definition):

select_bitand PROC            ;; MSVC
    test     cl, dl           ;; ZF =  !(a & b)
    cmovne   r9d, r8d
    mov      eax, r9d         ;; could have done the mov to eax in parallel with the test, off the critical path, but close enough.
    ret      0

ICC18 wastes two movzx instructions zero-extending the bools to int, but then makes the same code as MSVC

select_bitand:          ## ICC18
    movzx     edi, dil                                      #16.49
    movzx     esi, sil                                      #16.49
    test      edi, esi                                      #17.15
    cmovne    ecx, edx                                      #17.15
    mov       eax, ecx                                      #17.15
    ret                                                     #17.15

这篇关于在编译器中为8位的布尔值.对它们的操作效率低下吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆