奇怪的gcc6.1 -O2编译行为 [英] Strange gcc6.1 -O2 compiling behaviour

查看:132
本文介绍了奇怪的gcc6.1 -O2编译行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用gcc -O2 -march = native标志编译相同的基准。然而,有趣的是,当我看着objdump时,它实际上产生了一些像vxorpd等的指令,我认为这应该只在ftree-vectorize被启用时出现(并且O2不应该默认启用它)。如果我添加-m32标志在32位指令中编译时,这些打包的指令消失了。任何遇到类似情况的人都可以提供一些解释?谢谢。

解决方案

XORPD 是经典的SSE2指令,它在两个压缩的双精度浮点值上执行按位逻辑XOR。



VXORPD 是同一指令的矢量版本。实质上,它是经典的SSE2 XORPD 指令,带有 VEX前缀。这就是操作码中V前缀的意思。它与AVX(高级矢量扩展)一起引入,并且在任何支持AVX的架构上都受支持。 (实际上有两个版本,适用于128位AVX寄存器的VEX.128编码版本,以及适用于256位AVX2寄存器的VEX.256编码版本。)

所有传统的SSE和SSE2指令都可以添加VEX前缀,为它们提供三种操作数形式,并允许它们与其他新AVX指令进行交互和更高效地调度。它还避免了 VEX与非VEX模式之间转换的高成本。否则,这些新的编码保持相同的行为。因此,编译器通常会在目标体系结构支持它们时生成这些指令的VEX前缀版本。显然,在你的情况下, march = native 指定了一个至少支持AVX的体系结构。



在GCC和Clang上,即使在优化关闭的情况下( -O0 ),实际上也会发出这些指令,所以在启用优化时您肯定会获得这些指令。无论是 -ftree-vectorize 开关,还是其他任何特定于矢量化的优化开关都不需要开启,因为这实际上并不涉及矢量化代码。更确切地说,代码流没有改变,只是指令的编码。



你可以用最简单的代码来看这个:

  double Foo()
{
return 0.0;





  Foo():
vxorpd xmm0,xmm0,xmm0
ret

这解释了为什么当您使用 -march = native 编译64位版本时,您看到 VXORPD 及其朋友> switch。



当你抛出 -m32 时, / code>开关(这意味着为32位平台生成代码)。针对这些平台时,SSE和AVX指令仍然可用,并且我相信它们将在某些情况下使用,但由于32位ABI的显着差异,它们不能经常使用。具体而言,32位ABI要求在x87浮点堆栈上返回浮点值。由于这需要使用x87浮点指令,因此优化程序倾向于坚持使用这些指令,除非它严格地向量化一段代码。这是将x87堆栈中的值重新混合到SIMD寄存器并再次返回的唯一时间。否则,这会导致性能下降,实际效益甚微。



您也可以看到这一点。通过抛出 -m32 开关来查看输出中的更改:

  Foo():
fldz
ret

FLDZ 是x87 FPU指令,用于在浮点堆栈的顶部加载常量零,准备将其返回给调用者。



显然,因为你让代码更复杂,你更可能改变优化器的启发式,并说服它发出SIMD指令。如果您启用基于矢量化的优化,则您更有可能继续使用该功能。


I am compiling the same benchmark using gcc -O2 -march=native flags. However, Interesting thing is when I look at the objdump, it actually produce some instructions like vxorpd etc, which I think should only appear when ftree-vectorize is enabled (and O2 should not enable this by default?) If I add -m32 flag to compile in 32 bit instruction, these packed instructions disappeared. Anyone met similar situations could give some explanations? Thanks.

解决方案

XORPD is the classic SSE2 instruction that performs a bitwise logical XOR on two packed double-precision floating-point values.

VXORPD is the vector version of that same instruction. Essentially, it is the classic SSE2 XORPD instruction with a VEX prefix. That's what the "V" prefix means in the opcode. It was introduced with AVX (Advanced Vector Extensions), and is supported on any architecture that supports AVX. (There are actually two versions, the VEX.128-encoded version that works on 128-bit AVX registers, and the VEX.256-encoded version that works on 256-bit AVX2 registers.)

All of the legacy SSE and SSE2 instructions can have a VEX prefix added to them, giving them a three-operand form and allowing them to interact and schedule more efficiently with the other new AVX instructions. It also avoids the high cost of transitions between VEX and non-VEX modes. Otherwise, these new encodings retain identical behavior. As such, compilers will typically generate VEX-prefixed versions of these instructions whenever the target architecture supports them. Clearly, in your case, march=native is specifying an architecture that supports, at a minimum, AVX.

On GCC and Clang, you will actually get these instructions emitted even with optimization turned off (-O0), so you will certainly get them when optimizations are enabled. Neither the -ftree-vectorize switch, nor any of the other vectorization-specific optimization switches need to be on because this doesn't actually have anything to do with vectorizing your code. More precisely, the code flow hasn't changed, just the encoding of the instructions.

You can see this with the simplest code imaginable:

double Foo()
{
   return 0.0;
}

Foo():
        vxorpd  xmm0, xmm0, xmm0
        ret

So that explains why you're seeing VXORPD and its friends when you compile a 64-bit build with the -march=native switch.

That leaves the question of why you don't see it when you throw the -m32 switch (which means to generate code for 32-bit platforms). SSE and AVX instructions are still available when targeting these platforms, and I believe they will be used under certain circumstances, but they cannot be used quite as frequently because of significant differences in the 32-bit ABI. Specifically, the 32-bit ABI requires that floating-point values be returned on the x87 floating point stack. Since that requires the use of the x87 floating point instructions, the optimizer tends to stick with those unless it is heavily vectorizing a section of code. That's the only time it really makes sense to shuffle values from the x87 stack to SIMD registers and back again. Otherwise, that's a performance drain for little to no practical benefit.

You can see this too in action. Look at what changes in the output just by throwing the -m32 switch:

Foo():
        fldz
        ret

FLDZ is the x87 FPU instruction for loading the constant zero at the top of the floating-point stack, where it is ready to be returned to the caller.

Obviously, as you make the code more complicated, you are more likely to change the optimizer's heuristics and persuade it to emit SIMD instructions. You are far more likely still if you enable vectorization-based optimizations.

这篇关于奇怪的gcc6.1 -O2编译行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆