为什么英特尔的编译器更喜欢NEG + ADD而不是SUB? [英] Why does Intel's compiler prefer NEG+ADD over SUB?
问题描述
在检查各种代码片段的各种编译器的输出时,我注意到英特尔的C编译器(ICC)倾向于 strong 倾向于发出一对NEG
+ ADD
指令,其他编译器将使用一条SUB
指令.
作为一个简单的示例,请考虑以下C代码:
uint64_t Mod3(uint64_t value)
{
return (value % 3);
}
ICC将此转换为以下机器代码(与优化级别无关):
mov rcx, 0xaaaaaaaaaaaaaaab
mov rax, rdi
mul rcx
shr rdx, 1
lea rsi, QWORD PTR [rdx+rdx*2]
neg rsi ; \ equivalent to:
add rdi, rsi ; / sub rdi, rsi
mov rax, rdi
ret
其他编译器(包括MSVC,GCC和Clang)都将生成本质上等效的代码,只是将NEG
+ ADD
序列替换为单个SUB
指令.
就像我说的那样,这不仅是ICC如何编译此特定代码段的怪癖.这是我在对反汇编进行算术运算时反复观察到的一种模式.我通常不会考虑太多,除了ICC被认为是一个非常好的优化编译器 之外,它是由对微处理器有内幕信息的人们开发的.
英特尔在处理器上实现SUB
指令的实现方面是否会有所了解,从而使其更佳地分解为NEG
+ ADD
指令?使用RISC风格的指令解码为更简单的µop是现代微体系结构的著名优化建议,因此SUB
可能在内部分解为单独的NEG
和ADD
µop,实际上更有效为前端解码器使用这些更简单"的指令?现代的CPU 很复杂,因此一切皆有可能.
Agner Fog的综合说明表证实了我的直觉,尽管这实际上是一个悲观.在所有处理器上,SUB
的效率均与ADD
相同,因此,所需的附加NEG
指令仅能减慢速度.
我还通过英特尔自己的体系结构代码运行了这两个序列分析器来分析吞吐量.尽管确切的周期数和端口绑定从一个微体系结构到另一个微体系结构都不同,但是从Nehalem到Broadwell,单个SUB
似乎在各个方面都比较优越.这是该工具为Haswell生成的两个报告:
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.85 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations)
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.5 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.8 | 1.7 | 0.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | 0.1 | 0.2 | | | | 0.3 | 0.4 | | CP | mov rax, 0xaaaaaaaaaaaaaaab
| 2 | | 1.0 | | | | | 1.0 | | CP | mul rcx
| 1 | 0.9 | | | | | | 0.1 | | CP | shr rdx, 0x1
| 1 | | | | | | 1.0 | | | CP | lea rax, ptr [rdx+rdx*2]
| 1 | | 0.3 | | | | 0.4 | 0.2 | | CP | sub rcx, rax
| 1* | | | | | | | | | | mov rax, rcx
Total Num Of Uops: 7
NEG + ADD
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 2.15 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations)
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.1 0.0 | 2.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 2.0 | 2.0 | 0.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | 0.1 | 0.9 | | | | 0.1 | 0.1 | | | mov rax, 0xaaaaaaaaaaaaaaab
| 2 | | 1.0 | | | | | 1.0 | | CP | mul rcx
| 1 | 1.0 | | | | | | | | CP | shr rdx, 0x1
| 1 | | | | | | 1.0 | | | CP | lea rax, ptr [rdx+rdx*2]
| 1 | | 0.1 | | | | 0.8 | 0.1 | | CP | neg rax
| 1 | 0.1 | | | | | 0.1 | 0.9 | | CP | add rcx, rax
| 1* | | | | | | | | | | mov rax, rcx
Total Num Of Uops: 8
据我所知,NEG
+ ADD
增加了代码大小,增加了μop的数量,增加了执行端口的压力,并增加了循环数量,因此导致了与SUB
相比,吞吐量净下降.那么,为什么英特尔的编译器会这样做?
这仅仅是应该报告为缺陷的代码生成器的一些怪癖,还是我在分析中缺少一些优点?
奇怪的是,我有一个简单的答案:因为ICC并不是最佳选择.
编写自己的编译器时,您将开始使用一些非常基本的操作代码集:NOP
,MOV
,ADD
...最多10个操作码.暂时不要使用SUB
,因为它很容易被替换为:ADD NEGgative operand
. NEG
也不是基本的,它可能会被替换为:XOR FFFF...; ADD 1
.
因此,您可以对操作数类型和大小实现相当复杂的基于位的寻址.您可以针对一条机器代码指令(例如ADD
)执行此操作,并计划将其进一步用于大多数其他指令.但是到此时,您的同事无需使用SUB
就可以完成对余数的最佳计算!想象一下-它已经被称为"Optimal_Mod",所以您会错过一些不理想的东西,不是因为您是个坏家伙而讨厌AMD,而是因为您看到了-它已经被称为优化,优化了.
总体来说,英特尔编译器相当不错,但是它的版本历史很长,因此在极少数情况下它的行为可能会很奇怪.我建议您将这一问题通知英特尔,并看看会发生什么.
In examining the output of various compilers for a variety of code snippets, I've noticed that Intel's C compiler (ICC) has a strong tendency to prefer emitting a pair of NEG
+ADD
instructions where other compilers would use a single SUB
instruction.
As a simple example, consider the following C code:
uint64_t Mod3(uint64_t value)
{
return (value % 3);
}
ICC translates this to the following machine code (regardless of optimization level):
mov rcx, 0xaaaaaaaaaaaaaaab
mov rax, rdi
mul rcx
shr rdx, 1
lea rsi, QWORD PTR [rdx+rdx*2]
neg rsi ; \ equivalent to:
add rdi, rsi ; / sub rdi, rsi
mov rax, rdi
ret
Whereas other compilers (including MSVC, GCC, and Clang) will all generate essentially equivalent code, except that the NEG
+ADD
sequence is replaced by a single SUB
instruction.
Like I said, this isn't just a quirk of how ICC compiles this particular snippet. It's a pattern I've observed repeatedly when analyzing the disassembly for arithmetic operations. I normally wouldn't think much of this, except that ICC is known to be a pretty good optimizing compiler and it is developed by folks that have insider information about their microprocessors.
Could there be something that Intel knows about the implementation of the SUB
instruction on their processors that makes it more optimal to decompose it into NEG
+ADD
instructions? Using RISC-style instructions that decode into simpler µops is well-known optimization advice for modern microarchitectures, so is it possible that SUB
is broken down internally into individual NEG
and ADD
µops, and that it is actually more efficient for the front-end decoder to use these "simpler" instructions? Modern CPUs are complicated, so anything is possible.
Agner Fog's comprehensive instruction tables confirm my intuition, though, that this is actually a pessimization. SUB
is equally as efficient as ADD
on all processors, so the additional required NEG
instruction just serves to slow things down.
I also ran the two sequences through Intel's own Architecture Code Analyzer to analyze the throughput. Though the exact cycle counts and port bindings vary from one microarchitecture to another, a single SUB
appears to be superior in every respect from Nehalem to Broadwell. Here are the two reports generated by the tool for Haswell:
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.85 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations)
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.5 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.8 | 1.7 | 0.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | 0.1 | 0.2 | | | | 0.3 | 0.4 | | CP | mov rax, 0xaaaaaaaaaaaaaaab
| 2 | | 1.0 | | | | | 1.0 | | CP | mul rcx
| 1 | 0.9 | | | | | | 0.1 | | CP | shr rdx, 0x1
| 1 | | | | | | 1.0 | | | CP | lea rax, ptr [rdx+rdx*2]
| 1 | | 0.3 | | | | 0.4 | 0.2 | | CP | sub rcx, rax
| 1* | | | | | | | | | | mov rax, rcx
Total Num Of Uops: 7
NEG+ADD
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 2.15 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations)
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.1 0.0 | 2.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 2.0 | 2.0 | 0.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | 0.1 | 0.9 | | | | 0.1 | 0.1 | | | mov rax, 0xaaaaaaaaaaaaaaab
| 2 | | 1.0 | | | | | 1.0 | | CP | mul rcx
| 1 | 1.0 | | | | | | | | CP | shr rdx, 0x1
| 1 | | | | | | 1.0 | | | CP | lea rax, ptr [rdx+rdx*2]
| 1 | | 0.1 | | | | 0.8 | 0.1 | | CP | neg rax
| 1 | 0.1 | | | | | 0.1 | 0.9 | | CP | add rcx, rax
| 1* | | | | | | | | | | mov rax, rcx
Total Num Of Uops: 8
So, as far as I can tell, NEG
+ADD
increases the code size, increases the number of µops, increases pressure for execution ports, and increases the number of cycles, thus resulting in a net decrease in throughput compared to SUB
. So why is Intel's compiler doing this?
Is this just some quirk of the code generator that should be reported as a defect, or am I missing some merit in my analysis?
Strangely I have a simple answer: Because ICC isn't optimal.
When you write own compiler you get started with some very basic set of operation codes: NOP
, MOV
, ADD
... up to 10 opcodes. You don't use SUB
for a while because it might easily be replaced by: ADD NEGgative operand
. NEG
isn't basic as well, as it might be replaced by: XOR FFFF...; ADD 1
.
So you implement rather complex bit-based addressing of operand types and sizes. You do it for a single machine code instruction (eg. ADD
) and plan to use it further for most other instructions. But by this time your co-worker finishes implementation of optimal calculation of remainder without use of SUB
! Imagine - it's already called "Optimal_Mod" so you miss some inoptimal thing inside not because you're a bad guy and hate AMD but just because you see - it's already called optimal, optimized.
Intel Compiler is pretty good in general, but it has a long version history, so it can behave strange in some rare cases. I suggest you inform Intel about this issue and look what will happen.
这篇关于为什么英特尔的编译器更喜欢NEG + ADD而不是SUB?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!