为什么icc为简单的main生成奇怪的程序集? [英] Why is icc generating weird assembly for a simple main?

查看:124
本文介绍了为什么icc为简单的main生成奇怪的程序集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的程序:

int main()
{
    return 2*7;
}

启用优化功能的GCC和clang都可以轻松生成2条指令二进制文件,但是icc却给出了奇怪的输出.

both GCC and clang with optimizations turned on hapily generate 2 instruction binary, but icc gives bizarre output.

     push      rbp                                           #2.1
     mov       rbp, rsp                                      #2.1
     and       rsp, -128                                     #2.1
     sub       rsp, 128                                      #2.1
     xor       esi, esi                                      #2.1
     mov       edi, 3                                        #2.1
     call      __intel_new_feature_proc_init                 #2.1
     stmxcsr   DWORD PTR [rsp]                               #2.1
     mov       eax, 14                                       #3.12
     or        DWORD PTR [rsp], 32832                        #2.1
     ldmxcsr   DWORD PTR [rsp]                               #2.1
     mov       rsp, rbp                                      #3.12
     pop       rbp                                           #3.12
     ret

推荐答案

我不知道为什么ICC选择按2条缓存行对齐堆栈:

I don't know why ICC chooses to align the stack by 2 cache lines:

and       rsp, -128                                     #2.1
sub       rsp, 128                                      #2.1

这很有趣. L2缓存具有一个相邻行的预取器,该预取器喜欢将成对的线(在128字节对齐的组中)拉入L2.但是main的堆栈框架通常不被大量使用.某些程序中可能在其中分配了重要的变量. (这也说明了设置rbp来保存旧的RSP,以便它在ANDing之后可以返回.gcc还在函数中使用RBP来使堆栈帧与堆栈对齐.)

That's interesting. L2 cache has an adjacent-line prefetcher that likes to pull pairs of lines (in a 128-byte aligned group) into L2. But main's stack frame is not usually heavily used. Maybe important variables are allocated there in some programs. (This also explains setting up rbp, to save the old RSP so it can return after ANDing. gcc makes stack frames with RBP in functions where it aligns that stack, too.)

剩下的是因为main()是特殊的,并且ICC默认启用-ffast-math . (这是Intel的肮脏"小秘密之一,可让它自动矢量化更多开箱即用的浮点代码.)

The rest is because main() is special, and ICC enables -ffast-math by default. (This is one of Intel's "dirty" little secrets, and lets it auto-vectorize more floating-point code out of the box.)

这包括在main的顶部添加代码以设置MXCSR中的DAZ/FTZ位(SSE状态/控制寄存器).有关这些位的更多信息,请参阅Intel的x86手册,但它们实际上并不复杂:

This includes adding code to the top of main to set the DAZ / FTZ bits in the MXCSR (SSE status / control register). See Intel's x86 manuals for more about these bits, but they're really not complicated:

  • DAZ:异常为零:作为SSE ​​/AVX指令的输入,异常被视为零.

  • DAZ: Denormals Are Zero: as inputs to an SSE/AVX instruction, denormals are treated as zero.

FTZ:刷新为零:对SSE/AVX指令的结果四舍五入时,次标准结果将刷新为零.

FTZ: Flush To Zero: When rounding the result of an SSE/AVX instruction, subnormal results are flushed to zero.

相关: SSE异常为零";选项

( ISO C ++禁止程序调用main(),因此允许编译器将运行一次的内容放置在main本身而不是CRT启动文件中.gcc/clang with -ffast-math用于在设置MXCSR的CRT启动文件中链接,但在使用gcc/clang进行编译时,它只会影响允许优化的代码源,即在不同的时间将FP add/mul视为关联的意思是真的不是.这与设置DAZ/FTZ完全无关.

(ISO C++ forbids a program from calling back into main(), so compilers are allowed to put run-once stuff in main itself instead of in CRT startup files. gcc/clang with -ffast-math specified for linking link in CRT startup files that set the MXCSR. But when compiling with gcc/clang, it only affects code-gen in terms of which optimizations are allowed. i.e. treating FP add/mul as associative, when different temporaries mean it's really not. This is totally unrelated to setting DAZ/FTZ).

此处,反常数被用作次正数的同义词:FP值具有最小指数和一个有效位数,其中隐含的前导位为0而不是1,即,值小于

Denormal is being used as a synonym for subnormal here: an FP value with the minimum exponent and a significand where the implicit leading bit is 0 instead of 1. i.e. a value with magnitude small than FLT_MIN or DBL_MIN, the smallest representable normalized float/double.

https://en.wikipedia.org/wiki/Denormal_number .

产生次标准结果的指令可能要慢得多:为了优化延迟,某些硬件中的快速路径采用了标准结果,如果结果无法进行标准则采用微码辅助.使用perf stat -e fp_assist.any对此类事件进行计数.

Instructions that produce a subnormal result can be much slower: to optimize for latency, the fast path in some hardware assumes normalized results, and takes a microcode assist if the result can't be normalized. Use perf stat -e fp_assist.any to count such events.

摘自Bruce Dawson出色的FP文章系列:

From Bruce Dawson's excellent series of FP articles: That’s Not Normal–the Performance of Odd Floats. Also:

  • Why does changing 0.1f to 0 slow down performance by 10x?
  • Avoiding denormal values in C++

Agner Fog做了一些测试(请参阅他的 microarch pdf ),并为Haswell/Broadwell进行报告:

Agner Fog has done some testing (see his microarch pdf), and reports for Haswell/Broadwell:

下溢和次常态

浮点运算接近时出现次正规数 下溢.在某些情况下,处理非正规数非常昂贵 情况是因为次标准结果是由微码处理的 例外.

Subnormal numbers occur when floating point operations are close to underflow. The handling of subnormal numbers is very costly in some cases because the subnormal results are handled by microcode exceptions.

Haswell和Broadwell的罚款约为124个时钟 在所有情况下的循环,其中对普通数的运算给出一个 次正常的结果.乘法也有类似的惩罚 介于正常数和次正常数之间,无论是否 结果是正常还是次正常.添加法线不收取任何罚款 和一个次正规数,无论结果如何.没有罚款 对于上溢,下溢,无穷大或非整数结果.

The Haswell and Broadwell have a penalty of approximately 124 clock cycles in all cases where an operation on normal numbers gives a subnormal result. There is a similar penalty for a multiplication between a normal and a subnormal number, regardless of whether the result is normal or subnormal. There is no penalty for adding a normal and a subnormal number, regardless of the result. There is no penalty for overflow, underflow, infinity or not- a-number results.

如果齐平零"被归零,则避免了对次普通数的惩罚. 模式和零归零".模式都在MXCSR中设置 注册.

The penalties for subnormal numbers are avoided if the "flush-to-zero" mode and the "denormals-are-zero" mode are both set in the MXCSR register.

因此,在某些情况下,现代的Intel CPU甚至避免使用次标准的惩罚,但是

So in some cases, modern Intel CPUs avoid penalties even with subnormals, but

这篇关于为什么icc为简单的main生成奇怪的程序集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆