为什么英特尔的编译器更喜欢NEG + ADD而不是SUB? [英] Why does Intel's compiler prefer NEG+ADD over SUB?

查看:55
本文介绍了为什么英特尔的编译器更喜欢NEG + ADD而不是SUB?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在检查各种代码片段的各种编译器的输出时,我注意到英特尔的C编译器(ICC)倾向于 strong 倾向于发出一对NEG + ADD指令,其他编译器将使用一条SUB指令.

作为一个简单的示例,请考虑以下C代码:

uint64_t Mod3(uint64_t value)
{
    return (value % 3);
}

ICC将此转换为以下机器代码(与优化级别无关):

mov       rcx, 0xaaaaaaaaaaaaaaab
mov       rax, rdi
mul       rcx
shr       rdx, 1
lea       rsi, QWORD PTR [rdx+rdx*2]
neg       rsi                            ; \  equivalent to:
add       rdi, rsi                       ; /    sub  rdi, rsi
mov       rax, rdi
ret         

其他编译器(包括MSVC,GCC和Clang)都将生成本质上等效的代码,只是将NEG + ADD序列替换为单个SUB指令.

就像我说的那样,这不仅是ICC如何编译此特定代码段的怪癖.这是我在对反汇编进行算术运算时反复观察到的一种模式.我通常不会考虑太多,除了ICC被认为是一个非常好的优化编译器 之外,它是由对微处理器有内幕信息的人们开发的.

英特尔在处理器上实现SUB指令的实现方面是否会有所了解,从而使其更佳地分解为NEG + ADD指令?使用RISC风格的指令解码为更简单的µop是现代微体系结构的著名优化建议,因此SUB可能在内部分解为单独的NEGADD µop,实际上更有效为前端解码器使用这些更简单"的指令?现代的CPU 很复杂,因此一切皆有可能.

Agner Fog的综合说明表证实了我的直觉,尽管这实际上是一个悲观.在所有处理器上,SUB的效率均与ADD相同,因此,所需的附加NEG指令仅能减慢速度.

我还通过英特尔自己的体系结构代码运行了这两个序列分析器来分析吞吐量.尽管确切的周期数和端口绑定从一个微体系结构到另一个微体系结构都不同,但是从Nehalem到Broadwell,单个SUB似乎在各个方面都比较优越.这是该工具为Haswell生成的两个报告:

SUB

Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.85 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.5  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.8  | 1.7  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.2 |           |           |     | 0.3 | 0.4 |     | CP | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.3 |           |           |     | 0.4 | 0.2 |     | CP | sub rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 7

NEG + ADD

Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 2.15 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.1    0.0  | 2.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 2.0  | 2.0  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.9 |           |           |     | 0.1 | 0.1 |     |    | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 1.0       |     |           |           |     |     |     |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.1 |           |           |     | 0.8 | 0.1 |     | CP | neg rax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 8

据我所知,NEG + ADD增加了代码大小,增加了μop的数量,增加了执行端口的压力,并增加了循环数量,因此导致了SUB相比,吞吐量净下降.那么,为什么英特尔的编译器会这样做?

这仅仅是应该报告为缺陷的代码生成器的一些怪癖,还是我在分析中缺少一些优点?

解决方案

奇怪的是,我有一个简单的答案:因为ICC并不是最佳选择.

编写自己的编译器时,您将开始使用一些非常基本的操作代码集:NOPMOVADD ...最多10个操作码.暂时不要使用SUB,因为它很容易被替换为:ADD NEGgative operand. NEG也不是基本的,它可能会被替换为:XOR FFFF...; ADD 1.

因此,您可以对操作数类型和大小实现相当复杂的基于位的寻址.您可以针对一条机器代码指令(例如ADD)执行此操作,并计划将其进一步用于大多数其他指令.但是到此时,您的同事无需使用SUB就可以完成对余数的最佳计算!想象一下-它已经被称为"Optimal_Mod",所以您会错过一些不理想的东西,不是因为您是个坏家伙而讨厌AMD,而是因为您看到了-它已经被称为优化,优化了.

总体来说,英特尔编译器相当不错,但是它的版本历史很长,因此在极少数情况下它的行为可能会很奇怪.我建议您将这一问题通知英特尔,并看看会发生什么.

In examining the output of various compilers for a variety of code snippets, I've noticed that Intel's C compiler (ICC) has a strong tendency to prefer emitting a pair of NEG+ADD instructions where other compilers would use a single SUB instruction.

As a simple example, consider the following C code:

uint64_t Mod3(uint64_t value)
{
    return (value % 3);
}

ICC translates this to the following machine code (regardless of optimization level):

mov       rcx, 0xaaaaaaaaaaaaaaab
mov       rax, rdi
mul       rcx
shr       rdx, 1
lea       rsi, QWORD PTR [rdx+rdx*2]
neg       rsi                            ; \  equivalent to:
add       rdi, rsi                       ; /    sub  rdi, rsi
mov       rax, rdi
ret         

Whereas other compilers (including MSVC, GCC, and Clang) will all generate essentially equivalent code, except that the NEG+ADD sequence is replaced by a single SUB instruction.

Like I said, this isn't just a quirk of how ICC compiles this particular snippet. It's a pattern I've observed repeatedly when analyzing the disassembly for arithmetic operations. I normally wouldn't think much of this, except that ICC is known to be a pretty good optimizing compiler and it is developed by folks that have insider information about their microprocessors.

Could there be something that Intel knows about the implementation of the SUB instruction on their processors that makes it more optimal to decompose it into NEG+ADD instructions? Using RISC-style instructions that decode into simpler µops is well-known optimization advice for modern microarchitectures, so is it possible that SUB is broken down internally into individual NEG and ADD µops, and that it is actually more efficient for the front-end decoder to use these "simpler" instructions? Modern CPUs are complicated, so anything is possible.

Agner Fog's comprehensive instruction tables confirm my intuition, though, that this is actually a pessimization. SUB is equally as efficient as ADD on all processors, so the additional required NEG instruction just serves to slow things down.

I also ran the two sequences through Intel's own Architecture Code Analyzer to analyze the throughput. Though the exact cycle counts and port bindings vary from one microarchitecture to another, a single SUB appears to be superior in every respect from Nehalem to Broadwell. Here are the two reports generated by the tool for Haswell:

SUB

Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.85 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.5  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.8  | 1.7  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.2 |           |           |     | 0.3 | 0.4 |     | CP | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.3 |           |           |     | 0.4 | 0.2 |     | CP | sub rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 7

NEG+ADD

Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 2.15 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.1    0.0  | 2.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 2.0  | 2.0  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.9 |           |           |     | 0.1 | 0.1 |     |    | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 1.0       |     |           |           |     |     |     |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.1 |           |           |     | 0.8 | 0.1 |     | CP | neg rax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 8

So, as far as I can tell, NEG+ADD increases the code size, increases the number of µops, increases pressure for execution ports, and increases the number of cycles, thus resulting in a net decrease in throughput compared to SUB. So why is Intel's compiler doing this?

Is this just some quirk of the code generator that should be reported as a defect, or am I missing some merit in my analysis?

解决方案

Strangely I have a simple answer: Because ICC isn't optimal.

When you write own compiler you get started with some very basic set of operation codes: NOP, MOV, ADD... up to 10 opcodes. You don't use SUB for a while because it might easily be replaced by: ADD NEGgative operand. NEG isn't basic as well, as it might be replaced by: XOR FFFF...; ADD 1.

So you implement rather complex bit-based addressing of operand types and sizes. You do it for a single machine code instruction (eg. ADD) and plan to use it further for most other instructions. But by this time your co-worker finishes implementation of optimal calculation of remainder without use of SUB! Imagine - it's already called "Optimal_Mod" so you miss some inoptimal thing inside not because you're a bad guy and hate AMD but just because you see - it's already called optimal, optimized.

Intel Compiler is pretty good in general, but it has a long version history, so it can behave strange in some rare cases. I suggest you inform Intel about this issue and look what will happen.

这篇关于为什么英特尔的编译器更喜欢NEG + ADD而不是SUB?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆