是leaq慢或者还有另外一个原因较小的组件列表要比较长的慢? [英] Is leaq THAT slow or there is another reason smaller assembly listing is slower than the longer one?

查看:242
本文介绍了是leaq慢或者还有另外一个原因较小的组件列表要比较长的慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道任何真正的集会,但可以阅读GCC -S 输出来评估定的C code的实际成本。

I do not know any real assembly, but can read GCC -S output to evaluate actual costs of given C code.

这问题不是那么多有关分析和基准,而是教育。我需要有人来解释一下我为什么[1]片段并不比第二个快。

This question is not that much about profiling and benchmarks, but rather educational. I need someone to explain me why [1] snippet is not faster than the second one.

好了,以前觉得这样的:是啊,像一些MUL操作是pretty昂贵,但如果一个组件比另一个大X倍,它应该会比较慢。

Well, used to think like: "yeah, some operations like MUL are pretty expensive, but if one assembly is X times bigger than another, it should be slower".

直到我遇见这是千真万确这两个:

This was quite true until I have met those two:

unsigned char bytes[4] = {0, 0, 0, 5};

// 1
int32_t val = *((int32_t*)bytes);      
/* produces:
        leaq    -16(%rbp), %rax
        movl    (%rax), %eax
        movl    %eax, -4(%rbp)
        movl    $0, %eax
*/

// 2   
val = bytes[3] |                               
      (bytes[2] << 8) |                        
      (bytes[1] << 16) |
      (bytes[0] << 24);
/* produces: 
        movzbl  -13(%rbp), %eax
        movzbl  %al, %eax
        movzbl  -14(%rbp), %edx
        movzbl  %dl, %edx
        sall    $8, %edx
        orl     %eax, %edx
        movzbl  -15(%rbp), %eax
        movzbl  %al, %eax
        sall    $16, %eax
        orl     %eax, %edx
        movzbl  -16(%rbp), %eax
        movzbl  %al, %eax
        sall    $24, %eax
        orl     %edx, %eax
        movl    %eax, -4(%rbp)
        movl    $0, %eax
*/

和基准测试显示,2-ND一个是快5-10%。
这里发生了什么?

And benchmarks are showing that the 2-nd one is 5-10% faster. What is going on here?

唯一显著差异,理性我能想象是 LEAQ 是一件很慢。
最后两行是相同的,所以也许 MOV 价格如此之高,1个额外 MOV 比吨指示差。

The only significant difference and "reason" I can imagine is LEAQ is something very slow. Last 2 lines are identical, so maybe MOV price is so high that 1 extra MOV is worse than tons of instructions.

下面是我用来衡量执行时间:

Here is what I used to measure execution times:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

#define REPETITIONS 32
#define ITERATIONS 90000

#define CODE1                   \
  for (int i = 0; i < ITERATIONS; ++i) {    \
    val = *((int32_t*)bytes);           \
  }

#define CODE2                   \
  for (int i = 0; i < ITERATIONS; ++i) {    \
    val = bytes[3] |                \
      (bytes[2] << 8) |             \
      (bytes[1] << 16) |            \
      (bytes[0] << 24);             \
  }

int main(void) {
  clock_t minTs = 999999999, ts;
  unsigned char bytes[4] = {0, 0, 0, 5};    
  int32_t val;                  

  for (int i = 0; i < REPETITIONS; ++i) {
    ts = clock();

    CODE1; // Or CODE2

    ts = clock() - ts;
    if (ts < minTs) minTs = ts;
  }

  printf("ts: %ld\n", minTs);

  return val;
}



更新的:
事实证明,结果是特定的硬件,因此,虽然[1]是我的笔记本电脑(64 i5-4260U)慢,它更快我的电脑上(但像5%的很小一部分)。


Update: as it turns out, results are hardware specific, so while [1] is slower on my laptop (x64 i5-4260U), it is faster on my PC (but by a very small fraction like 5%).

推荐答案

如果您想了解您的code编译如何向ASM,把你的code在不能得到优化的功能。在此godbolt链接,你可以看到GCC 5.1和更高版本(在 -O3 -march =原生-mtune =本地)通过的ORing看到字节,并使用一起 movbe (移动大端),以bwap上,而装载飞。 ICC,铛,而单个字节和移位/或它们加载到地方的旧有的GCC EMIT说明。

what you should have done

If you're curious about how your code compiles to asm, put your code in a function where it can't get optimized away. On this godbolt link, you can see how gcc 5.1 and newer (at -O3 -march=native -mtune=native) sees through the ORing together of bytes and uses movbe (move big-endian) to bwap on the fly while loading. icc, clang, and older gcc emit instructions that load individual bytes and shift / OR them into place.

我感到失望的是编译器不通过字节的ORing看到,甚至当我颠倒了顺序做一个小端的负载,而不是大端负荷。 (见godbolt本机,大,和little endian功能。)即使改变类型uint32_t的和uint8_t有没有帮助对于gcc其它编译器> = 5.1。

I was disappointed that compilers didn't see through the ORing of bytes even when I reversed the order to do a little-endian load instead of a big-endian load. (see the native, big, and little endian functions on godbolt.) Even changing the types to uint32_t and uint8_t didn't help for compilers other than gcc >= 5.1.

显然与优化,编译器扔掉,只是设置一个未使用的变量循环。他们只是叫时钟两次,printf的,然后MOV,立即回答到 EAX 。如果你想实际基准的东西,从将与常量数据调用它的函数分别进行编译。通过这种方式,你可以有简单的code,它需要的工作作为函数的参数,并且不能得到联到传递它的常量数据调用者。

Obviously with optimization on, compilers throw away the loops that just set an unused variable. They just call clock twice, printf, and then mov-immediate the answer into eax. If you want to actually benchmark something, compile it separately from the function that will call it with constant data. That way, you can have simple code that takes its work as function arguments, and it can't get inlined into the caller that passes it constant data.

此外,海合会对待作为一个冷的功能,而并不完全为重优化它的正常功能。在大多数程序中,不包含内部循环。

Also, gcc treats main as a "cold" function, and doesn't optimize it quite as heavily as normal functions. In most programs, main doesn't contain the inner loop.

显然,code是可怕 -O0 ,存储到内存中,甚至在递增内存中的循环计数器的 的。它仍然有些有趣找出比我预计,code1在根据1时钟的insn为什么它的运行更慢。

Obviously the code is horrible from -O0, storing to memory, and even incrementing the loop counter in memory. It's still somewhat interesting to figure out why it's running even slower than I'd expect, CODE1 at under one insn per clock.

您并没有显示出任何一张code的整个循环。也许去除循环体仍然会留下一个缓慢的循环。我认为,循环本身的问题,而且是效率低下,有时间为CPU做在code2的所有额外的指令没有看到任何放缓。

You didn't show the whole loop for either piece of code. Probably removing the loop body would still leave a slow loop. I think the loop itself is the problem, and is so inefficient that there's time for the CPU to do all the extra instructions in CODE2 without seeing any slowdown.

TL; DR :两个环是由瓶颈的加$ 1,-0x24(%RBP),这递增循环计数器在内存。在循环携带依赖性链6个周期的延迟解释了为什么这两个回路瓶颈,以相同的速度。

TL;DR: both loops are bottlenecked by the add $1, -0x24(%rbp), which increments the loop counter in memory. 6 cycle latency in a loop-carried dependency chain explains why both loops are bottlenecked to the same speed.

我不知道为什么在code2上的额外的指令以某种方式帮助更接近每次迭代理论最大的6个周期,但这不应该永远拿出在code的瓶颈,任何人将永远写。请注册您的循环计数器,并且不要把包含在一个循环携带依赖性链的同一地址的读 - 修改 - 写指令。 (递增存储在不同的地址每次迭代是好的,例如用于计数排序。)

I don't know exactly why the extra instructions in CODE2 somehow help get closer to the 6 cycles per iteration theoretical max, but this isn't a bottleneck that should ever come up in code that anyone would ever write. Keep your loop counters in registers, and don't put include a read-modify-write instruction of the same address in a loop-carried dependency chain. (incrementing memory at different addresses each iteration is fine, e.g. for CountingSort.)

请参阅 godbolt 为我的code所做的更改。 (由100所以运行时因素颠簸ITERATIONS主导启动开销的噪音。)这种联系使优化,为前3功能的好处。

See godbolt for the changes I made to the code. (With ITERATIONS bumped up by a factor of 100 so runtime dominates the noise of startup overhead.) That link has optimization enabled, for the benefit of the first 3 functions.

godbolt没有一个C模式下,只有C ++和我从GCC 4.9.2一个坏少环比godbolt显示本地。 (G ++实现了一个为(){} 完全一样,即使在写的,与顶部的CMP / JCC,并在底部。GCC一个JMP环 -O0 使用做{}而(计数++&LT;迭代); 结构,只用CMP / JLE底部

godbolt doesn't have a C mode, only C++, and I got a less-bad loop from gcc 4.9.2 locally than godbolt shows. (g++ implements a for(){} loop exactly as written, with a cmp/jcc at the top, and a jmp at the bottom. gcc even at -O0 uses a do{} while(count++ < ITERATIONS); structure, with just a cmp/jle at the bottom.

我不知道任何真正的集会,但可以阅读GCC -S输出
  评估定的C code的实际成本。

I do not know any real assembly, but can read GCC -S output to evaluate actual costs of given C code.

好了,以前觉得这样的:是啊,像一些MUL操作是pretty
  昂贵的,但如果一个程序集是X倍比另一个越大,
  应该会比较慢。

Well, used to think like: "yeah, some operations like MUL are pretty expensive, but if one assembly is X times bigger than another, it should be slower".

要查找的第一件事就是吞吐量和延迟的瓶颈。在英特尔,这意味着每个时钟吞吐量4微指令,还是少如果长时间依赖链是有极限的。再有就是每次执行的端口瓶颈。例如。每个时钟两个存储器OPS,与他们的至多一个是存储。至多一个 MUL 每时钟,因为只有3个ALU港口之一,拥有整数倍数。

The first thing to look for is throughput and latency bottlenecks. On Intel, that means 4 uops per clock throughput, or less if a long dependency chain is a limit. Then there's per-execution-port bottlenecks. E.g. two memory ops per clock, with at most one of them being a store. At most one mul per clock, because only one of the 3 ALU ports has an integer multiplier.

请参阅瓦格纳雾的网站优化指南,微架构文档和指令延迟的表/吞吐量/微指令/端口就可以运行。

See Agner Fog's site for optimization guides, micro-architecture docs, and tables of instruction latency/throughputs / uops / ports they can run on.

您环路严重通过保持循环计数器在内存瓶颈。在SandyBridge的(我的系统正)和Haswell的(你的),瓦格纳雾的表有的延迟添加与存储目的地在6个时钟。有没有办法可以大于每迭代6个时钟一次迭代跑得更快。随着循环6条指令,这是每个周期1的insn。

Your loops are badly bottlenecked by keeping the loop counter in memory. On SandyBridge (my sytem) and Haswell (yours), Agner Fog's table has the latency of add with a memory destination at 6 clocks. There's no way it can run any faster than one iteration per 6 clocks per iteration. With 6 instructions in the loop, that's 1 insn per cycle.

在实践中,我发现了比少的吞吐量。也许店作为的一部分添加的读 - 修改 - 写操作有时在循环中的其他负载/存储延迟。 IDK为什么code2稍快,这很奇怪。也许这定购不同的事情如此循环携带的依赖性添加延迟较少。

In practice, I'm getting less throughput than that. Maybe the store as part of add's read-modify-write operation is sometimes delayed by the other loads/stores in the loop. IDK why CODE2 is slightly faster, that is strange. Maybe it orders things differently so the loop-carried-dependency add is delayed less often.

LEA 循环的体内的使用和一个32位的负载明显快。 IDK为什么你认为它是 LEA 这是缓慢的。

The loop body using lea and a 32bit load is obviously faster. IDK why you think it's the lea that's slow.

这不是一个对齐/ UOP缓存的问题。该循环应该从环路缓冲器无论哪种方式流,即使在code的32B块均大于18微指令(意味着它不能在UOP缓存去)。前端瓶颈(不是分支误predicts,我们没有其他的)不能是一个问题,当每个时钟我们的insn是如此之低。前端可以轻松地微指令的一大供应排队调度程序来调度。

It's not an an alignment / uop-cache issue. The loop should stream from the loop buffer either way, even if there were more than 18 uops in a 32B block of code (meaning it couldn't go in the uop cache). Frontend bottlenecks (other than branch mispredicts, which we don't have) can't be an issue when our insns per clock are so low. The frontend can easily keep a big supply of uops queued up for the scheduler to dispatch.

PERF的报告,分析对每条指令的时钟,:code1的内部循环。计数不是周期精确。正确的之后,我们很可能看到卡上的指示CPU 的的加$ 1,纪念品,我敢肯定是瓶颈循环携带依赖。它需要的存储转发到下一个迭代,这仍需要6个时钟的负载。

From perf report, profiling clocks taken on each instruction: CODE1's inner loop. Counts aren't cycle-accurate. We're probably seeing the CPU stuck on the instructions right after the add $1, mem, which I'm sure is the bottleneck loop-carried dependency. It needs to forward the store to the load on the next iteration, which still takes 6 clocks.

   ###### CODE1 inner loop, profiling on cycles
 13.97 │400680:   lea    -0x10(%rbp),%rax
       │400684:   mov    (%rax),%eax
       │400686:   mov    %eax,-0x2c(%rbp)
       │400689:   addl   $0x1,-0x24(%rbp)
 13.84 │40068d:   cmpl   $0x89543f,-0x24(%rbp)
 72.19 │400694: ↑ jle    400680 <code1+0x4a>
       ## end of the loop
        400696:   callq  4004e0 <clock@plt>
        40069b:   sub    -0x18(%rbp),%rax

       #CODE2
 15.08 │400738:   movzbl -0xd(%rbp),%eax
  0.88 │40073c:   movzbl %al,%eax
  0.05 │40073f:   movzbl -0xe(%rbp),%edx
       │400743:   movzbl %dl,%edx
 13.91 │400746:   shl    $0x8,%edx
  0.70 │400749:   or     %eax,%edx
  0.05 │40074b:   movzbl -0xf(%rbp),%eax
       │40074f:   movzbl %al,%eax
 12.70 │400752:   shl    $0x10,%eax
  0.60 │400755:   or     %eax,%edx
  0.09 │400757:   movzbl -0x10(%rbp),%eax
  0.05 │40075b:   movzbl %al,%eax
 13.03 │40075e:   shl    $0x18,%eax
  0.70 │400761:   or     %edx,%eax
  0.14 │400763:   mov    %eax,-0x2c(%rbp)
  0.09 │400766:   addl   $0x1,-0x24(%rbp)
 13.63 │40076a:   cmpl   $0x89543f,-0x24(%rbp)
 28.29 │400771: ↑ jle    400738 <code2+0x4a>
     ## end of the loop
        400773: → callq  4004e0 <clock@plt>
        400778:   sub    -0x18(%rbp),%rax 

哇,这是pretty热闹。海湾合作委员会做了冗余 movzbl%人,%EAX 加载后%EAX 从一个8位的内存位置 movzbl

Wow, that's pretty hilarious. gcc does a redundant movzbl %al, %eax after loading %eax from an 8bit memory location with movzbl.

因此​​,在每次迭代6个时钟,可以在CPU处理加载相结合的字节所有忙碌的工作?是的。

So in 6 clocks per iteration, can the CPU handle all that busy-work of loading an combining bytes? Yes.


  • 4X MOVZX章,MEM :4负载端口微指令。 (P2 / P3)

  • 4X MOVZX章,章:4微指令任何ALU端口(P015)

  • 3倍 SHL章,IMM :3微指令的ALU端口P0 / P5

  • 3倍或章,章:3微指令任何ALU端口(P015)

  • 1个 MOV纪念品,章:1 UOP融合域:1存储数据(P4),1号店地址(P23)

  • 1个添加纪念品,IMM :2融合域。非融合:1 ALU UOP(P015),1负载UOP(P23),1号店数据(P4),1号店地址(P23)

  • 1个 CMP纪念品,IMM :1 UOP为P015,1 P23

  • 1个 JLE :1 UOP对p5。 (不能宏观保险丝,因为IMM和mem与CMP,)

  • 4x movzx reg, mem: 4 load-port uops. (p2/p3)
  • 4x movzx reg, reg: 4 uops for any ALU port (p015)
  • 3x shl reg, imm: 3 uops for ALU ports p0/p5
  • 3x or reg, reg: 3 uops for any ALU port (p015)
  • 1x mov mem, reg: 1 uop fused-domain: 1 store-data (p4), 1 store-address (p23)
  • 1x add mem, imm: 2 fused-domain. unfused: 1 ALU uop (p015), 1 load uop (p23), 1 store-data (p4), 1 store-address (p23)
  • 1x cmp mem, imm: 1 uop for p015, 1 for p23.
  • 1x jle: 1 uop for p5. (can't macro-fuse with the cmp, because of imm and mem)

总熔化域微指令:4 + 4 + 3 + 3 + 1 + 2 + 1 + 1 = 19即装配在28uop循环流缓存器,避免了UOP缓存瓶颈任何方法可行,并可以在5发出时钟。 (4在每个周期,与上一个周期仅发行3)。

total fused-domain uops: 4+4+3+3+1+2+1+1 = 19. That fits in the 28uop loop stream buffer, avoiding any possiblity of uop-cache bottlenecks, and can issue in 5 clocks. (At 4 per cycle, with the last cycle issuing only 3).

负荷微指令:4 + 1 + 1 = 6店微指令:2

load uops: 4 + 1 + 1 = 6. store uops: 2.

ALU微指令:4 + 3 + 3 + 1 + 1 + 1 = 13,SNB的3 ALU UOP端口可以处理,在4.33时钟。大多数的微指令可以在任何端口上运行,所以没有一个端口是一个瓶颈。 (Haswell的有4 ALU端口,P6,它有一个更容易的时间,但ALU微指令是不是瓶颈。)

ALU uops: 4+3+3+1+1+1 = 13. SnB's 3 ALU uop ports can handle that in 4.33 clocks. Most of the uops can run on any port, so no one port is a bottleneck. (Haswell has a 4th ALU port, p6. It has an even easier time. But ALU uops aren't the bottleneck.)

循环体的延迟并不重要,因为下一次迭代无法读取任何结果。每次迭代读取一些数据,并存储,独立的previous反复做了什么。许多圈都是这样的。这是通常对这种循环从加载每一次存储到不同的地址,但是CPU做什么它告诉。

The latency of the loop bodies don't matter, because the next iteration doesn't read any result. Each iteration reads some data and stores it, independently of what the previous iteration did. Many loops are like this. It's usual for such loops to load from and store to a different address each time, but the CPU just does what it's told.

无论如何,即使每个循环中的依赖链花费的时间超过6个时钟,从多个迭代工作可以在飞行中。在一次迭代中没有必须等待的previous,的与存储器目的地环路计数器递增。

Anyway, even if the dependency chain within each loop takes more than 6 clocks, work from multiple iterations can be in flight. Nothing in one iteration has to wait for the previous, except the loop-counter increment with a memory destination.

因此​​,在code2循环所有这些工作是不是瓶颈的。

So all that work in the CODE2 loop is not a bottleneck at all.

有关SNB / HSW,附加与直接内存目的地是2微指令,而在内存目的地inc一家3,根据瓦格纳雾的表,这是奇怪的。我不知道这是否是一个错误,或者如果英特尔CPU使用 INC 在存储器的目的,而不是加$ 1如果真的是慢

For SnB/HSW, add-immediate with a memory destination is 2 uops, while inc on a memory destination is 3, according to Agner Fog's table, which is strange. I wonder if that's an error, or if Intel CPUs really are slower when using inc on a memory destination instead of add $1?

测试计时(从GCC 4.9.2)。我没有看到任何明显,可以解释为什么code2逐渐接近每6个时钟一次迭代的理论最大值。我唯一​​的猜测是,code1由呼叫混淆 JLE ,但code1之后是不是?也许在一个微指令PERF记录

Test timings (from gcc 4.9.2). I don't see anything obvious that would explain why CODE2 gets closer to the theoretical max of one iteration per 6 clocks. My only guess is that CODE1 is confused by the call right after the jle, but CODE1 isn't? Maybe a perf record on uops

SandyBridge的酷睿i5(2500K):

Sandybridge i5 (2500k):

## CODE1 ##
 Performance counter stats for './a.out 1' (4 runs):

        589.117019      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.30% )
     2,068,118,847      cycles                    #    3.511 GHz                      ( +-  0.48% )
     1,729,505,746      instructions              #    0.84  insns per cycle        
                                                  #    0.86  stalled cycles per insn  ( +-  0.01% )
     2,018,533,639      uops_issued_any           # 3426.371 M/sec                    ( +-  0.02% )
     5,648,003,370      uops_dispatched_thread    # 9587.235 M/sec                    ( +-  2.51% )
     3,170,497,726      uops_retired_all          # 5381.779 M/sec                    ( +-  0.01% )
     2,018,126,488      uops_retired_retire_slots # 3425.680 M/sec                    ( +-  0.01% )
     1,491,267,343      stalled-cycles-frontend   #   72.11% frontend cycles idle     ( +-  0.66% )
        27,192,780      stalled-cycles-backend    #    1.31% backend  cycles idle     ( +- 68.75% )

       0.589651097 seconds time elapsed                                          ( +-  1.32% )

这是非常不寻常的,看uops_dispatched_thread不匹配uops_retired_all。它们通常都相同,并且等于未融合微指令的为在循环中的指令数目。稠域uops_issued_any和uops_retired_retire_slots通常都是相等的,它们是在这种情况下。也许内存目的地ALU OPS算在不同出动与retired_all? (微熔合)。我想我的previous测试只有<一个href=\"http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695\">looked在负载微融合。

我不认为它发出任何转出不被需要的微指令。 (这不是一个分支误predict问题;我检查,这两个版本有0.00%分支误predicts(仅〜10K为288M分支))

I don't think it's issuing uops that turn out not to be needed. (It's not a branch-mispredict issue; I checked, and both versions have 0.00% branch mispredicts. (only ~10k for 288M branches).)

## CODE2 ##
peter@tesla:~/src/SO$ ocperf.py stat -r4 -e task-clock,cycles,instructions,uops_issued.any,uops_dispatched.thread,uops_retired.all,uops_retired.retire_slots,stalled-cycles-frontend,stalled-cycles-backend ./a.out 2
perf stat -r4 -e task-clock,cycles,instructions,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_dispatched_thread/,cpu/event=0xc2,umask=0x1,name=uops_retired_all/,cpu/event=0xc2,umask=0x2,name=uops_retired_retire_slots/,stalled-cycles-frontend,stalled-cycles-backend ./a.out 2
CODE2: ts: 16499
CODE2: ts: 16535
CODE2: ts: 16517
CODE2: ts: 16529

 Performance counter stats for './a.out 2' (4 runs):

        543.886795      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.01% )
     2,007,623,288      cycles                    #    3.691 GHz                      ( +-  0.01% )
     5,185,222,133      instructions              #    2.58  insns per cycle        
                                                  #    0.11  stalled cycles per insn  ( +-  0.00% )
     5,474,164,892      uops_issued_any           # 10064.898 M/sec                    ( +-  0.00% )
     7,586,159,182      uops_dispatched_thread    # 13948.048 M/sec                    ( +-  0.00% )
     6,626,044,316      uops_retired_all          # 12182.764 M/sec                    ( +-  0.00% )
     5,473,742,754      uops_retired_retire_slots # 10064.121 M/sec                    ( +-  0.00% )
       566,873,826      stalled-cycles-frontend   #   28.24% frontend cycles idle     ( +-  0.03% )
         3,744,569      stalled-cycles-backend    #    0.19% backend  cycles idle     ( +-  2.32% )

       0.544190971 seconds time elapsed                                          ( +-  1.01% )

在融合域,发行和放大器; retired_slots匹配code2。

In the fused-domain, issued & retired_slots match for CODE2.

这篇关于是leaq慢或者还有另外一个原因较小的组件列表要比较长的慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆