两个看似等效的汇编代码之间的性能差异 [英] Performance difference between two seemingly equivalent assembly codes

查看:106
本文介绍了两个看似等效的汇编代码之间的性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

tl; dr :我有两个用Clang编译的功能等效的C代码(事实是C代码没什么大不了;我认为只有程序集很有趣)和IACA告诉我一个应该更快,但是我不明白为什么,我的基准测试显示这两个代码具有相同的性能.

tl;dr: I have two functionally equivalent C codes that I compile with Clang (the fact that it's C code doesn't matter much; only the assembly is interesting I think), and IACA tells me that one should be faster, but I don't understand why, and my benchmarks show the same performance for the two codes.

我有以下C代码(暂时忽略#include "iacaMarks.h"IACA_STARTIACA_END):

I have the following C code (ignore #include "iacaMarks.h", IACA_START, IACA_END for now):

ref.c:

#include "iacaMarks.h"
#include <x86intrin.h>

#define AND(a,b)  _mm_and_si128(a,b)
#define OR(a,b)   _mm_or_si128(a,b)
#define XOR(a,b)  _mm_xor_si128(a,b)
#define NOT(a)    _mm_andnot_si128(a,_mm_set1_epi32(-1))

void sbox_ref (__m128i r0,__m128i r1,__m128i r2,__m128i r3,
               __m128i* r5,__m128i* r6,__m128i* r7,__m128i* r8) {
  __m128i r4;

  IACA_START
  r3 = XOR(r3,r0);
  r4 = r1;
  r1 = AND(r1,r3);
  r4 = XOR(r4,r2);
  r1 = XOR(r1,r0);
  r0 = OR(r0,r3);
  r0 = XOR(r0,r4);
  r4 = XOR(r4,r3);
  r3 = XOR(r3,r2);
  r2 = OR(r2,r1);
  r2 = XOR(r2,r4);
  r4 = NOT(r4);
  r4 = OR(r4,r1);
  r1 = XOR(r1,r3);
  r1 = XOR(r1,r4);
  r3 = OR(r3,r0);
  r1 = XOR(r1,r3);
  r4 = XOR(r4,r3);
  *r5 = r1;
  *r6 = r4;
  *r7 = r2;
  *r8 = r0;
  IACA_END

}

我想知道是否可以通过手动重新调度一些指令来优化它(我很清楚C编译器应该产生有效的调度,但是我的实验表明情况并非总是如此).在某些时候,我尝试了以下代码(与上面的代码相同,不同之处在于,不使用任何临时变量来存储稍后分配给*r5*r6的XOR的结果):

I was wondering if I could optimize it by manually rescheduling a few instructions (I am well aware that the C compiler should produce an efficient scheduling, but my experiments have shown that it's not always the case). At some point, I tried the following code (it's the same as above, except that no temporary variables are used to store the results of the XORs that are later assigned to *r5 and *r6):

resched.c:

resched.c:

#include "iacaMarks.h"
#include <x86intrin.h>

#define AND(a,b)  _mm_and_si128(a,b)
#define OR(a,b)   _mm_or_si128(a,b)
#define XOR(a,b)  _mm_xor_si128(a,b)
#define NOT(a)    _mm_andnot_si128(a,_mm_set1_epi32(-1))

void sbox_resched (__m128i r0,__m128i r1,__m128i r2,__m128i r3,
                   __m128i* r5,__m128i* r6,__m128i* r7,__m128i* r8) {
  __m128i r4;

  IACA_START
  r3 = XOR(r3,r0);
  r4 = r1;
  r1 = AND(r1,r3);
  r4 = XOR(r4,r2);
  r1 = XOR(r1,r0);
  r0 = OR(r0,r3);
  r0 = XOR(r0,r4);
  r4 = XOR(r4,r3);
  r3 = XOR(r3,r2);
  r2 = OR(r2,r1);
  r2 = XOR(r2,r4);
  r4 = NOT(r4);
  r4 = OR(r4,r1);
  r1 = XOR(r1,r3);
  r1 = XOR(r1,r4);
  r3 = OR(r3,r0);
  *r7 = r2;
  *r8 = r0;
  *r5 = XOR(r1,r3); // This two lines are different
  *r6 = XOR(r4,r3); // (no more temporary variables)
  IACA_END
}

我正在使用针对我的i5-6500(Skylake)的Clang 5.0.0编译这些代码,并带有标志-O3 -march=native(我省略了产生的汇编代码,因为它们可以在下面的IACA输出中找到,但是如果您希望直接将它们放在这里,请问我,我会添加它们).我对这两个代码进行了基准测试,发现它们之间没有任何性能差异.出于好奇,我对它们执行了IACA,而令我惊讶的是,它说第一个版本需要6个周期才能运行,而第二个版本需要7个周期. 这是IACA产生的输出:

I'm compiling these codes using Clang 5.0.0 targeting my i5-6500 (Skylake), with the flags -O3 -march=native (I'm omitting the assembly code produced, as they can be found in the IACA outputs bellow, but if you'd prefer to have them directly here, ask me and I'll add them). I benchmarked those two codes and didn't find any performance difference between them. Out of curiosity, I ran IACA on them, and I was surprised to see that it said that the first version should take 6 cycles to run, and the second version 7 cycles. Here are the output produce by IACA:

对于第一个版本:

dada@dada-ubuntu ~/perf % clang -O3 -march=native -c ref.c && ./iaca -arch SKL ref.o        
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File -  ref_iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 6.00 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  6.0     0.0  |  6.0  |  1.3     0.0  |  1.4     0.0  |  4.0  |  6.0  |  0.0  |  1.4  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm4, xmm3, xmm0
|   1      |             | 1.0  |             |             |      |      |      |      | vpand xmm5, xmm4, xmm1
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm1, xmm2, xmm1
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm5, xmm5, xmm0
|   1      |             | 1.0  |             |             |      |      |      |      | vpor xmm0, xmm3, xmm0
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm0, xmm0, xmm1
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm1, xmm4, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm3, xmm4, xmm2
|   1      |             |      |             |             |      | 1.0  |      |      | vpor xmm2, xmm5, xmm2
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm2, xmm2, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpcmpeqd xmm4, xmm4, xmm4
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm1, xmm1, xmm4
|   1      | 1.0         |      |             |             |      |      |      |      | vpor xmm1, xmm5, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm4, xmm5, xmm3
|   1      |             |      |             |             |      | 1.0  |      |      | vpor xmm3, xmm0, xmm3
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm4, xmm4, xmm3
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm4, xmm4, xmm1
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm1, xmm1, xmm3
|   2^     |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rdi], xmm4
|   2^     |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rsi], xmm1
|   2^     |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rdx], xmm2
|   2^     |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rcx], xmm0
Total Num Of Uops: 26

对于第二个版本:

dada@dada-ubuntu ~/perf % clang -O3 -march=native -c resched.c && ./iaca -arch SKL resched.o
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File -  resched_iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 7.00 Cycles       Throughput Bottleneck: Backend
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  6.0     0.0  |  6.0  |  1.3     0.0  |  1.4     0.0  |  4.0  |  6.0  |  0.0  |  1.3  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm4, xmm3, xmm0
|   1      |             | 1.0  |             |             |      |      |      |      | vpand xmm5, xmm4, xmm1
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm1, xmm2, xmm1
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm5, xmm5, xmm0
|   1      |             | 1.0  |             |             |      |      |      |      | vpor xmm0, xmm3, xmm0
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm0, xmm0, xmm1
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm1, xmm4, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm3, xmm4, xmm2
|   1      |             |      |             |             |      | 1.0  |      |      | vpor xmm2, xmm5, xmm2
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm2, xmm2, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpcmpeqd xmm4, xmm4, xmm4
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm1, xmm1, xmm4
|   1      | 1.0         |      |             |             |      |      |      |      | vpor xmm1, xmm5, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm4, xmm5, xmm3
|   1      |             |      |             |             |      | 1.0  |      |      | vpor xmm3, xmm0, xmm3
|   2^     |             |      | 0.3         | 0.4         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rdx], xmm2
|   2^     |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.4  | vmovdqa xmmword ptr [rcx], xmm0
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm0, xmm4, xmm3
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm0, xmm0, xmm1
|   2^     |             |      | 0.4         | 0.3         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rdi], xmm0
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm0, xmm1, xmm3
|   2^     |             |      | 0.3         | 0.4         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rsi], xmm0
Total Num Of Uops: 26
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.

如您所见,在第二版中,IACA表示瓶颈是后端,并且后端分配由于分配资源不可用而停滞了."

As you can see, on the second version, IACA says that the bottleneck is the backend and that "Backend allocation was stalled due to unavailable allocation resources".

两个汇编代码都包含相同的指令,唯一的区别是最后7条指令的调度以及它们使用的寄存器.

Both assembly codes contain the same instructions, and the only differences are the scheduling of the last 7 instructions, as well as the registers they use.

我能想到的唯一一件事可以解释为什么第二个代码较慢的原因是它在最后4条指令中写入了两次xmm0的事实,从而引入了依赖性.但是由于这些写操作是独立的,所以我希望CPU为它们使用不同的物理寄存器.但是,我无法真正证明这一理论.另外,如果像这样使用两次xmm0是个问题,我希望Clang对其中一条指令使用不同的寄存器(特别是因为此处的寄存器压力很低).

The only thing I can think of that would explain why the second code is slower is the fact that it writes twice xmm0 in the last 4 instructions, thus introducing a dependency. But since those writes are independent, I would expect the CPU to use different physical registers for them. However, I can't really prove that theory. Also, if using twice xmm0 like that were an issue, I would expect Clang to use a different register for one of the instructions (in particular since the register pressure here is low).

我的问题:第二个代码应该慢一些吗(基于汇编代码),为什么?

My question: is the second code supposed to be slower (based on the assembly code), and why?

编辑:IACA跟踪:

第一版: https://pastebin.com/qGXHVW6a
第二个版本: https://pastebin.com/dbBNWsc2

First version: https://pastebin.com/qGXHVW6a
Second version: https://pastebin.com/dbBNWsc2

注意:C代码是蛇密码的第一个S-box的实现,由Osvik 此处.

推荐答案

弄清楚为什么第二个代码是后端绑定的,需要进行一些手动分析,因为IACA发出的输出太原始了,尽管信息非常丰富.请注意,IACA发出的迹线对于分析循环特别有用,它们对于理解如何执行指令的直线序列(这没什么用处)也很有用,但是发出的迹线需要以不同的方式进行解释.贯穿此答案的其余部分,我将介绍我对循环场景的分析,这比较困难.

Figuring out why the second code is backend-bound requires some amount of manual analysis because the output emitted by IACA is too raw, although extremely rich in information. Note that the traces emitted by IACA are particularly useful for analyzing loops They can be also useful for understanding how straight-line sequences of instructions get executed (which is not as useful), but the emitted traces need to be interpreted differently. Throughput the rest of this answer, I will present my analysis for loop scenario, which is more difficult to do.

在不将代码放入循环的情况下发出跟踪的事实会影响以下内容:

The fact that you emitted the traces without putting the code in a loop affects the following things:

  • 编译器无法内联并优化存储,使其不输出输出操作数.它们根本不会出现在真实的循环中,或者如果将其链接到其他的S-box,则不会出现.
  • 由于编译器使用xmm0..3准备要存储的数据,所以巧合地发生了从输出到输入的数据依赖关系,而不是选择将哪个输出反馈到同一S-box的哪个输入. li>
  • 内联后,将创建一个全矢量(用于NOT)的vpcmpeqd吊出循环.
  • 会有dec/jnz或同等的循环开销(可以将其宏融合到端口6的单个uop中).
  • the compiler couldn't inline and optimize away the stores to the output operands. They wouldn't appear at all in a real loop, or if chaining this to a different S-box.
  • the data dependencies from outputs to inputs happens by coincidence as the compiler used xmm0..3 to prepare data to be stored, not as consequence of choosing which output to feed back into which input of the same S-box.
  • the vpcmpeqd that creates an all-ones vector (for NOT) would be hoisted out of the loop after inlining.
  • There would be a dec/jnz or equivalent loop overhead (which can macro-fused into a single uop for port 6).

但是您已经要求IACA分析这个asm精确块,就好像它是在循环中运行一样.因此,要解释结果,这就是我们如何看待它(即使您不是在循环中使用此函数,也不会从C编译器得到它).

But you've asked IACA to analyze this exact block of asm as if it were run in a loop. So to explain the results, that's how we'll think of it (even though it's not what you'd get from a C compiler if you used this function in a loop).

A jmpdec/jnz使其成为循环不是问题:它将始终在端口6上执行,任何向量指令均不使用该端口.这意味着跳转指令不会在端口6上竞争,也不会消耗其他指令本来会使用的调度程序uop带宽.但是,这可能会在问题/重命名阶段(每个周期不超过4个融合域uops)影响资源分配器的带宽,但这在这种特殊情况下并不重要,正如我将要讨论的.

A jmp or dec/jnz at the bottom to make this a loop is not a problem in this case: It will always get executed on port 6, which is not used by any vector instruction. This means that the jump instruction will not contend on port 6 and will not consume scheduler uop bandwidth that would otherwise have been used by other instructions. However, this can impact the resource allocator bandwidth in the issue/rename stage (which is no more than 4 fused domain uops per cycle), but this is not important in this particular case as I will discuss.

我们首先检查端口压力ASCII数字:

Let's first examine the port pressure ASCII figure:

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm4, xmm3, xmm0
|   1      |             | 1.0  |             |             |      |      |      |      | vpand xmm5, xmm4, xmm1
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm1, xmm2, xmm1
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm5, xmm5, xmm0
|   1      |             | 1.0  |             |             |      |      |      |      | vpor xmm0, xmm3, xmm0
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm0, xmm0, xmm1
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm1, xmm4, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm3, xmm4, xmm2
|   1      |             |      |             |             |      | 1.0  |      |      | vpor xmm2, xmm5, xmm2
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm2, xmm2, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpcmpeqd xmm4, xmm4, xmm4
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm1, xmm1, xmm4
|   1      | 1.0         |      |             |             |      |      |      |      | vpor xmm1, xmm5, xmm1
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm4, xmm5, xmm3
|   1      |             |      |             |             |      | 1.0  |      |      | vpor xmm3, xmm0, xmm3
|   2^     |             |      | 0.3         | 0.4         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rdx], xmm2
|   2^     |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.4  | vmovdqa xmmword ptr [rcx], xmm0
|   1      | 1.0         |      |             |             |      |      |      |      | vpxor xmm0, xmm4, xmm3
|   1      |             | 1.0  |             |             |      |      |      |      | vpxor xmm0, xmm0, xmm1
|   2^     |             |      | 0.4         | 0.3         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rdi], xmm0
|   1      |             |      |             |             |      | 1.0  |      |      | vpxor xmm0, xmm1, xmm3
|   2^     |             |      | 0.3         | 0.4         | 1.0  |      |      | 0.3  | vmovdqa xmmword ptr [rsi], xmm0

融合域uops的总数为22.已经为端口0、1和5分配了6个不同的uops.其他4个uops分别由STD和STA uops组成. STD需要端口4.此分配是合理的.如果我们忽略所有数据依赖关系,则调度程序似乎应该能够在每个周期中至少分配3个融合域uops.但是,端口4可能存在严重争用,这可能导致填满预订站.根据IACA,这不是此代码中的瓶颈.注意,如果调度程序可以某种方式实现等于分配器最大吞吐量的吞吐量,则代码只能是前端绑定的.显然,这里不是这种情况.

The total number of fused domain uops is 22. 6 different uops have been assigned to each of port 0, 1, and 5. The other 4 uops each consists of an STD and STA uops. STD requires port 4. This assignment is reasonable. If we ignore all data dependencies, it appears the scheduler should be able to dispatch at least 3 fused domain uops every cycle. However, there can be serious contention at port 4, which may lead to filling up the reservation station. According to IACA, that is not the bottleneck in this code. Note that if the scheduler could somehow achieve a throughput that is equal to the maximum throughput of the allocator, then the code could only be frontend-bound. Obviously, this is not the case here.

下一步是仔细检查IACA跟踪.我根据跟踪制作了以下数据流图,该图更易于分析.水平的黄线根据在同一周期中分配的微指令来划分图形.请注意,IACA始终假设完善的分支预测.另请注意,该划分的准确度约为99%,但并非100%.这并不重要,您可以认为它100%准确.节点表示融合的微指令,箭头表示数据依赖性(其中箭头指向目标微指令).节点的颜色取决于它们所属的循环迭代.为了清楚起见,省略了图形顶部箭头的来源.右侧的绿色框包含循环编号,在该循环编号处为相应的微指令进行分配.因此,前一个周期是X,而当前周期是X + 1,无论X是多少.停止标志表示相关联的uop在其中一个端口上发生争用.所有红色停止符号表示在端口1上的争用.只有另外一个不同颜色的停止符号表示在端口5上的争用.存在争用的情况,但为清楚起见,我将其省略.箭头有两种颜色:蓝色和红色.那些是关键的.请注意,分配2个迭代值的指令需要11个周期,然后重复分配模式.请记住,Skylake具有97个RS整体.

The next step is to carefully examine the IACA trace. I made the following data flow graph based on the trace, which is easier to analyze. The horizontal yellow lines divide the graph according to which uops get allocated in the same cycle. Note that IACA always assumes perfect branch prediction. Also note that this division is about 99% accurate, but not 100%. This is not important and you can just consider it 100% accurate. The nodes represent fused uops and the arrows represent data dependence (where the arrow points to the destination uop). Nodes are colored depending on which loop iteration they belong to. The sources of the arrows at the top of the graph are omitted for clarity. The green boxes on the right contain the cycle number at which allocation is performed for the corresponding uops. So the previous cycle is X, and the current cycle is X + 1, whatever X is. The stop signs indicate that the associated uop suffers contention at one of the ports. All the red stop signs represent contention on port 1. There is only one other stop sign of different color that represents contention on port 5. There are are cases of contention, but I'll omitted them for clarity. Arrows come in two colors: blue and red. The ones are the critical ones. Note that it takes 11 cycles to allocate 2 iterations worth of instructions, and then the allocation pattern repeats. Keep in mind that Skylake has 97 RS entires.

每个分区内节点的位置(本地"位置)具有含义.如果两个节点在同一行上,并且它们的所有操作数均可用,则意味着可以在同一周期内分派它们.否则,如果节点不在同一行上,则它们可能不会在同一周期内分派.这仅适用于已一起分配为一组的动态uops,而不适用于作为不同组的一部分分配的动态uops,即使它们恰好位于图中的同一划分中.

The location of a node within each division (the "local" location) has a meaning. If two nodes are on the same row and if all of their operands are available, then it means that they can be dispatched in the same cycle. Otherwise, if the nodes are not on the same row, then they may not be dispatched in the same cycle. This only applies to dynamic uops that have been allocated together as a group and not to dynamic uops allocated as part of different groups even if they happen to be in the same division in the graph.

我将使用符号(it, in)来标识特定的融合uop,其中it是从零开始的循环迭代编号,而in是从零开始的uop编号. IACA跟踪中最重要的部分是显示(11,5)的流水线阶段的部分:

I'll use the notation (it, in) to identify a specific fused uop, where it is a zero-based loop iteration number and in is a zero-based uop number. The most important part of the IACA trace is the one that shows the pipeline stages for (11, 5):

11| 5|vpxor xmm0, xmm0, xmm1                            :          |         |         |         |         |         |         |         |         |         |         |         |         |         |         
11| 5|    TYPE_OP (1 uops)                              :          |         |         |         |         |         |_A--------------------dw----R-------p  |         |         |         |         |         

这告诉我们,由于资源不可用(在这种情况下,是保留站中的条目),此时分配带宽未得到充分利用.这意味着调度程序无法维持足够高的未融合微指令吞吐量以跟上每个周期前端4个融合微指令的速度.由于IACA已经告诉我们代码是后端绑定的,因此,这种未充分利用的原因显然不是由于较长的依赖链或在特定执行单元上的争用,而是更复杂的原因.因此,我们需要做更多的工作来弄清楚发生了什么.我们必须分析过去(11,5).

This tells us that the allocation bandwidth is underutilized at this point due to unavailable resources (in this case, an entry in the reservation station). This means that the scheduler was not able to sustain a high enough throughput of unfused uops to keep up with the front-end 4 fused uops per cycle. Since IACA has already told us that the code is backend-bound, then obviously the reason for this underutilization is not because of some long dependency chain or contention at specific execution units, but rather something more complicated. So we need to do more work to figure out what's going on. We have to analyze past (11, 5).

每次迭代的uops 1、4、7、10、13、18都分配给端口1.在11个周期内会发生什么?总共需要12个微指令,需要端口1,因此不可能在11个周期内分派所有微指令,因为这至少需要12个周期.不幸的是,需要相同端口的uops内部和需要其他端口的uops之间的数据依赖性大大加剧了该问题.在一个11个周期内,请考虑以下管道流量:

The uops 1, 4, 7, 10, 13, 18 of every iteration are all assigned to port 1. What happens during a period of 11 cycles? There are a total of 12 uops that require port 1, so it's impossible to dispatch all of them in 11 cycles because it will take at least 12 cycles. Unfortunately, data dependencies within the uops that require the same port and across uops that require other ports exacerbate the problem significantly. Consider the following pipeline flow during an 11-cycle period:

  • 在周期0:(0,0)和(0,1)被分配(以及我们现在不在乎的其他uops). (0,1)取决于(0,0)的数据.
  • 1:(0,4)和(0,7)被分配.假设没有将较早的就绪的uops分配给端口0,并且操作数(0,0)就绪,则将(0,0)调度到端口0.端口1可能保持空闲状态,因为(0,1)尚未就绪
  • 2:(0,0)的结果可通过旁路网络获得.此时,(0,1)可以并且将被调度.但是,即使(0,4)或(0,7)准备就绪,也没有将最旧的uop分配给端口1,因此它们都会被阻塞. (0,10)得到分配.
  • 3:(0,4)被分派到端口1.(0,7)和(0,10)都被阻塞,即使它们的操作数已准备就绪. (0,13)得到分配.
  • 4:(0,7)被调度到端口1.(0,10)被阻塞. (0,13)必须等待(0,7). (0,18)得到分配.
  • 5:(0,10)被调度到端口1.(0,13)被阻塞. (0,18)必须等待(0,17)取决于(0,13). (1,0)和(1,1)得到分配.
  • 6:(0,13)被分派到端口1.(0,18)必须等待(0,17)依赖于(0,13). (1,1)必须等待(1,0).无法分配(1,0),因为(1,0)与(0,7)之间的距离为3 oups,其中之一可能会发生端口冲突. (1,4)得到分配.
  • 7:因为(0,18),(1、1)和(1,4)尚未准备就绪,所以没有任何内容分配到端口1. (1,7)得到分配.
  • 8:因为(0,18),(1、1),(1、4)和(1,7)还没有准备好,所以没有任何内容分配到端口1. (1,10)和(1,13)得到分配.
  • 9:(0,18)被调度到端口1.(1,10)和(1,4)准备就绪,但由于端口争用而被阻塞. (1,1),(1,7)和(1,13)尚未准备就绪.
  • 10:(1,1)被调度到端口1.(1,4),(1,7)和(1,10)已准备就绪,但由于端口争用而被阻塞. (1,13)没有准备好. (1,18)得到分配.
  • At cycle 0: (0, 0) and (0, 1) get allocated (along with other uops that we don't care about right now). (0, 1) is data-dependent on (0, 0).
  • 1: (0, 4) and (0, 7) get allocated. Assuming that no older and ready uops is assigned to port 0 and that the operands of (0, 0) are ready, dispatch (0, 0) to port 0. Port 1 potentially remains idle because (0, 1) is not ready yet.
  • 2: The result of (0, 0) is available through the the bypass network. At this point, (0, 1) can and will be dispatched. However, even if (0, 4) or (0, 7) are ready, neither is the oldest uop assigned to port 1, so it both get blocked. (0, 10) gets allocated.
  • 3: (0, 4) is dispatched to port 1. (0, 7) and (0, 10) both get blocked even if their operands are ready. (0, 13) gets allocated.
  • 4: (0, 7) is dispatched to port 1. (0, 10) gets blocked. (0, 13) has to wait for (0, 7). (0, 18) gets allocated.
  • 5: (0, 10) is dispatched to port 1. (0, 13) gets blocked. (0, 18) has to wait for (0, 17) which depends on (0, 13). (1, 0) and (1, 1) get allocated.
  • 6: (0, 13) is dispatched to port 1. (0, 18) has to wait for (0, 17) which depends on (0, 13). (1, 1) has to wait for (1, 0). (1, 0) cannot be dispatched because the distance between (1, 0) and (0, 7) is 3 uops, one of which may suffer a port conflict. (1, 4) gets allocated.
  • 7: Nothing gets dispatched to port 1 because (0, 18), (1, 1), and (1, 4) are not ready. (1, 7) gets allocated.
  • 8: Nothing gets dispatched to port 1 because (0, 18), (1, 1), (1, 4), and (1, 7) are not ready. (1, 10) and (1, 13) get allocated.
  • 9: (0, 18) is dispatched to port 1. (1, 10) and (1, 4) are ready but get blocked due to port contention. (1, 1), (1, 7), and (1, 13) are not ready.
  • 10: (1, 1) is dispatched to port 1. (1, 4), (1, 7), and (1, 10) are ready but get blocked due to port contention. (1, 13) is not ready. (1, 18) gets allocated.

好吧,理想情况下,我们希望在12个微指令中有11个以11个周期分配到端口1.但是,这种分析表明,这种情况远非理想.端口1在11个周期中有4个空闲!如果我们假设前一个迭代中的一些(X,18)在周期0处调度,则端口1将空闲3个周期,考虑到我们每11个周期有12个微指令需要它,这将是很多浪费.在12个微指令中,最多只有8个被调度.情况有多严重?我们可以继续分析跟踪并记录已准备好分发但由于冲突而被阻止或由于数据下降而未准备就绪的p1绑定的uops的数量.我能够确定由于端口冲突而停滞的p1绑定的数量永远不会大于3.但是,由于数据下降而停滞的p1绑定的数量总体上会随着时间逐渐增加.我没有看到其增加方式的任何模式,因此我决定在跟踪的前24个周期中使用线性回归来预测在什么时候会有97个此类微指令.下图显示了这一点.

Well, ideally, we'd like 11 of the 12 uops to be dispatched to port 1 in 11 cycles. But this analysis shows that the situation is far from ideal. Port 1 is idle for 4 out of the 11 cycles! If we assume that some (X, 18) from a previous iteration gets dispatched at cycle 0, then port 1 would be idle for 3 cycles, which is a lot of waste, considering that we have 12 uops that require it every 11 cycles. Out of the 12 uops, only up to 8 got dispatched. How bad can the situation get? We can continue analyzing the trace and record how the number of p1-bound uops that are either ready to be dispatched but blocked due to conflict, or are not ready due to data decencies. I was able to determine that that the number of p1-bound uops stalled due to port conflict is never larger than 3. However, the number of p1-bound uops stalled due due to data decencies is overall increasing gradually with time. I did not see any pattern the way it increases, so I decided to use linear regression on the first 24 cycles of the trace to predict at what point there would be 97 such uops. The following figure shows that.

x轴表示从左到右递增的从零开始的循环.请注意,在前4个周期内,微指令的数量为零. y轴表示相应循环中此类微指令的数量.线性回归方程为:

The x-axis represents the zero-based cycles increasing from left to right. Note that the number of uops is zero for the first 4 cycles. The y-axis represents the number of such uops at the corresponding cycle. The linear regression equation is:

y = 0.3624x-0.6925.

y = 0.3624x - 0.6925.

通过将y设置为97,我们得到:

By setting y to 97 we get:

x =(97 + 0.6925)/0.3624 = 269.57

x = (97 + 0.6925) / 0.3624 = 269.57

也就是说,在大约第269个周期,我们期望RS中所有p1绑定到97微码,并等待它们的操作数准备就绪.此时RS已满.但是,由于其他原因,RS中可能还有其他等待的对象.因此,我们期望分配器在周期269或周期之前未充分利用其带宽.通过查看指令(11,5)的IACA跟踪,我们可以看到情况发生在周期629,该周期早于269.这意味着我的预测器非常乐观,或者与其他端口绑定的uops数量也表现出类似的行为.我的胆量告诉我是后者.但这足以理解IACA为什么说代码是后端绑定的.您可以对第一个代码执行类似的分析,以了解为什么它是前端绑定的.我想我只是留给读者练习.

That is, at about cycle 269, we expect that there are 97 uops in the RS all p1-bound and waiting for their operands to become ready. It is at this point the RS is full. However, there can be other uops that are waiting in the RS for other reasons. So we expect that the allocator underutilize its bandwidth at or before cycle 269. by looking at the IACA trace for instruction (11, 5), we can see that the situation happens at cycle 61, which is much earlier than 269. This means that either my predictor is very optimistic or that the counts of uops bound to other ports exhibit also a similar behavior. My guts tell me it's the latter. But that is good enough to understand why IACA has said that the code is backend-bound. You can perform a similar analysis on the first code to understand why it's frontend-bound. I guess I'll just leave as an exercise for the reader.

在IACA不支持特定代码段或不存在针对特定微体系结构的IACA之类的工具的情况下,可以执行此手动分析.线性回归模型使您可以估计在分配了多少次迭代后分配器未充分利用其带宽.例如,在这种情况下,周期269对应于迭代269/11/2 = 269/22 =12.因此,只要最大迭代数不大于12,循环的后端性能就会降低问题.

This manual analysis can be followed in case IACA does not support a particular piece of code or when a tool like IACA does not exist for a particular microarhcitecture. The linear regression model enables to estimate after how many iterations the allocator underutilizes its bandwidth. For example in this case, cycle 269 corresponds to iteration 269/11/2 = 269/22 = 12. So as long as the maximum number of iterations is not much larger than 12, the backend performance of the loop would be less of an issue.

@Bee的相关文章:如何准确安排x86 uops ?.

There is a related post by @Bee: How are x86 uops scheduled, exactly?.

我可能会在之后的前24个周期中发布发生情况的详细信息.

I may post the details of what happens during the first 24 cycles later.

旁注:Wikichip在 Skylake 的文章中有两个错误一个>.首先,Broadwell的调度程序具有60个整体,而不是64个.其次,分配器的吞吐量仅达到4个融合uops.

Side note: There are two errors in Wikichip's article on Skylake. First, Broadwell's scheduler has 60 entires, not 64. Second, the allocator's throughput is up to 4 fused uops only.

这篇关于两个看似等效的汇编代码之间的性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆