为什么GCC不能为两个int32的结构生成最优算子==? [英] Why can't GCC generate an optimal operator== for a struct of two int32s?
问题描述
一个同事给我看了我认为没有必要的代码,但是可以肯定的是.我希望大多数编译器都将相等性测试中的所有这三种尝试视为等同:
A colleague showed me code that I thought wouldn't be necessary, but sure enough, it was. I would expect most compilers would see all three of these attempts at equality tests as equivalent:
#include <cstdint>
#include <cstring>
struct Point {
std::int32_t x, y;
};
[[nodiscard]]
bool naiveEqual(const Point &a, const Point &b) {
return a.x == b.x && a.y == b.y;
}
[[nodiscard]]
bool optimizedEqual(const Point &a, const Point &b) {
// Why can't the compiler produce the same assembly in naiveEqual as it does here?
std::uint64_t ai, bi;
static_assert(sizeof(Point) == sizeof(ai));
std::memcpy(&ai, &a, sizeof(Point));
std::memcpy(&bi, &b, sizeof(Point));
return ai == bi;
}
[[nodiscard]]
bool optimizedEqual2(const Point &a, const Point &b) {
return std::memcmp(&a, &b, sizeof(a)) == 0;
}
[[nodiscard]]
bool naiveEqual1(const Point &a, const Point &b) {
// Let's try avoiding any jumps by using bitwise and:
return (a.x == b.x) & (a.y == b.y);
}
但是令我惊讶的是,只有带有 memcpy
或 memcmp
的代码才被GCC变成单个64位比较.为什么?( https://godbolt.org/z/aP1ocs )
But to my surprise, only the ones with memcpy
or memcmp
get turned into a single 64-bit compare by GCC. Why? (https://godbolt.org/z/aP1ocs)
对于优化器来说,如果我检查四个字节的连续对是否相等,就等于对所有八个字节进行比较是否相同?
Isn't it obvious to the optimizer that if I check equality on contiguous pairs of four bytes that that's the same as comparing on all eight bytes?
避免将两个部分分别布尔化的尝试在某种程度上提高了编译效率(减少了一条指令,并且没有对EDX的错误依赖),但是仍然进行了两个单独的32位运算.
An attempt to avoid separately booleanizing the two parts compiles somewhat more efficiently (one fewer instruction and no false dependency on EDX), but still two separate 32-bit operations.
bool bithackEqual(const Point &a, const Point &b) {
// a^b == 0 only if they're equal
return ((a.x ^ b.x) | (a.y ^ b.y)) == 0;
}
当通过 value 传递结构时,
GCC和Clang都具有相同的遗漏优化(因此 a
在RDI中,而 b
在RSI中)因为这就是x86-64 System V的调用约定将结构打包到寄存器中的方式): https://godbolt.org/z/v88a6s.memcpy/memcmp版本都可以编译为 cmp rdi,rsi
/ sete al
,但是其他版本则分别执行32位操作.
GCC and Clang both have the same missed optimizations when passing the structs by value (so a
is in RDI and b
is in RSI because that's how x86-64 System V's calling convention packs structs into registers): https://godbolt.org/z/v88a6s. The memcpy / memcmp versions both compile to cmp rdi, rsi
/ sete al
, but the others do separate 32-bit operations.
struct alignas(uint64_t)Point
在按值存储的情况下仍能提供帮助,在这种情况下,参数位于寄存器中,从而优化了两个naiveEqual版本的GCC,但没有优化bithack XOR/OR.( https://godbolt.org/z/ofGa1f ).这是否给我们有关海湾合作委员会内部的任何暗示?对齐并不能帮助Clang.
struct alignas(uint64_t) Point
surprisingly still helps in the by-value case where arguments are in registers, optimizing both naiveEqual versions for GCC, but not the bithack XOR/OR. (https://godbolt.org/z/ofGa1f). Does this give us any hints about GCC's internals? Clang isn't helped by alignment.
推荐答案
如果您解决"问题,对齐方式,都给出相同的汇编语言输出(使用GCC):
If you "fix" the alignment, all give the same assembly language output (with GCC):
struct alignas(std::int64_t) Point {
std::int32_t x, y;
};
请注意,做某些事情(如punning类型)的一些正确/合法方法是使用 memcpy
,因此在使用该功能时进行特定的优化(或者更具攻击性)似乎是合乎逻辑的.
As a note, some correct/legal ways to do some stuff (as type punning) is to use memcpy
, so having specific optimization (or be more aggressive) when using that function seems logical.
这篇关于为什么GCC不能为两个int32的结构生成最优算子==?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!