为什么在asm中这种差异对性能有影响(在未优化的ptr ++与++ ptr循环中)? [英] Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?

查看:128
本文介绍了为什么在asm中这种差异对性能有影响(在未优化的ptr ++与++ ptr循环中)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR :第一个循环在Haswell CPU上运行速度快了〜18%。为什么?循环是使用 ptr ++ vs ++的 gcc -O0 ptr ,但问题是为什么结果asm的性能不同,而不是关于如何写更好的C。






假设我们有两个循环:

  movl $ 0,-48(%ebp)设置为0 
movl $ _data,-12(%ebp)//指向数据数组的指针
movl%eax,-96(%ebp)
movl%edx,-92 ebp)
jmp L21
L22:
// ptr ++
movl -12(%ebp),%eax //获取当前地址
leal 4 ),%edx //计算下一个地址
movl%edx,-12(%ebp)//存储新的(下一个)地址
//循环的其余部分与其他
movl -48(%ebp),%edx //获取循环计数器edx
movl%edx,(%eax)//将循环计数器值移动到CURRENT地址,注意-12 ebp)已经包含下一个
addl $ 1,-48(%ebp)//增加计数器
L21:
cmpl $ 999999,-48(%ebp)
jle L22

,第二个:

  movl%eax,-104(%ebp)
movl%edx,-100(%ebp)
movl $ _data-4,-12数据的地址-1元素(4字节)
movl $ 0,-48(%ebp)//将循环计数器设置为0
jmp L23
L24:
++ ptr
addl $ 4,-12(%ebp)//通过添加一个sizeof(int)== 4字节计算CURRENT地址
movl -12(%ebp),%eax //存储在eax中的地址
//其余的循环与其他的相同
movl -48(%ebp),%edx //存储在edx当前循环计数器
movl%edx ,(%eax)//将循环计数器值移动到当前存储的地址位置
addl $ 1,-48(%ebp)//增加循环计数器
L23:
cmpl $ 999999, -48(%ebp)
jle L24

这些循环完全相同,



此asm代码是从以下两个C ++循环生成的:

  // FIRST LOOP:
for(; index< size; index ++){
*(ptr ++)
}
//第二循环:
ptr = data - 1;
for(index = 0; index< size; index ++){
*(++ ptr)= index;
}

现在,第一个循环比第二个循环快约18%无论循环执行的顺序是 ptr ++ ,比 ++ ptr 更快。 / p>

要运行我的基准,我只是收集了不同大小的那些循环的运行时间,并执行它们嵌套在其他循环中以重复操作经常。






ASM分析



查看ASM代码,第二个循环包含更少的指令,我们有3 movl和2 addl,而在第一个循环中我们有4 movl一个addl和一个leal,所以我们有一个movl更多和一个leal,而不是addl



这是正确的,用于计算正确地址的 LEA 操作比 ADD (+4)方法?这是性能差异的原因吗?



据我所知,一旦一个新的地址在内存被引用之前被计算,一些时钟周期必须经过,因此,第二个循环之后的addl $ 4,-12(%ebp)需要等待一段时间才能继续,而在第一个循环中我们可以立即引用内存,同时LEAL将计算下一个地址(某种更好的流水线性能这里)。



这里有一些重新排序吗?我不知道我对这些循环的性能差异的解释,我可以有你的意见吗?

解决方案

首先,对 -O0 编译器输出的性能分析通常不是很有趣或有用。







用于计算正确地址的LEAL操作比ADDL(+4)方法快得多吗?这是性能差异的原因吗?


不,添加在任何x86 CPU上的每个ALU执行端口上运行。 lea 通常是具有简单寻址模式的低延迟,但不是那么好的吞吐量。在Atom上,它运行在与正常ALU指令不同的流水线阶段,因为它实际上存在到它的名字,并使用在有序微架构的AGU。



请参阅 x86 标记维基,了解如何创建代码慢或快的不同的微架构,尤其是。 Agner Fog的微架构pdf和说明表



add 只是更糟糕,因为它让gcc -O0 通过使用它与内存目标,然后从那加载更糟糕的代码。






使用 -O0 编译甚至尝试使用最好的说明的工作。例如您将获得 mov $ 0,%eax 而不是 xor%eax,%eax



-O0 code>代码总是充满瓶颈,通常在加载/存储或存储转发。不幸的是,IACA没有考虑存储转发延迟,所以它没有意识到这些循环实际上是瓶颈







据我所知,一旦一个新的地址计算之前,内存可以被引用一些时钟周期必须经过,所以第二个循环后addl $ 4,-12(%ebp)需要等待


是的, mov 加载<$ c $在添加的读取修改一部分的加载之后,大约6个周期后,c> -12(%ebp)


而在第一个循环中,我们可以立即引用内存



,同时LEAL会计算下一个地址


否。



您的分析很接近,但您错过了下一次迭代仍需加载的事实我们存储到 -12(%ebp)中的值。因此,循环运行的依赖链是相同的长度,下一个迭代的 lea 实际上不能在循环中使用 add






延迟问题可能不是循环吞吐量瓶颈:



uop /执行端口吞吐量需要考虑。在这种情况下,OP的测试显示它实际上是相关的。 (或资源冲突的延迟)



当gcc -O0 implements ptr ++ ,它保留旧的值在一个寄存器,就像你说的。因此,提前知道存储地址,并且需要AGU的负载更少一个。



假设使用英特尔SnB系列CPU:

  ## ptr ++:第一个循环
movl -12(%ebp),%eax // 1 uop(load)
leal 4(%eax),%edx // 1 uop(仅ALU)
movl%edx,-12(%ebp)// 1个存储地址,1个存储数据
// -12(%ebp)into%eax
...保持不变。


## ++ ptr:第二个循环
addl $ 4,-12(%ebp)// read-modify-write:2 fused-domain uops。 4 unfused:1 load + 1 store-address + 1 store-data
movl -12(%ebp),%eax // load:1 uop。 〜6周期延迟为%eax准备好
...休息相同

第二个循环的指针增量部分又有一个负载uop。可能是AGU吞吐量的代码瓶颈(地址生成单元)。 IACA说的是arch = SNB的情况,但是HSW对存储数据吞吐量(而不是AGU)的瓶颈。



但是,如果不考虑存储转发延迟, IACA说第一个循环可以每3.5个循环运行一次,第二个循环每4个循环运行一次。它比 addl $ 1,-48(%ebp)循环计数器的6个循环循环传递的依赖性更快,这表明循环是由延迟小于max的瓶颈AGU吞吐量。 (资源冲突可能意味着它实际上比每个6c的一个迭代运行慢,见下文)。



我们可以测试这个理论:



lea 版本中添加一个额外的加载uop,关键路径,将需要更多的吞吐量,但不会部分循环的延迟链。例如

  movl -12(%ebp),%eax //获取当前地址
leal 4 ),%edx //计算下一个地址
movl%edx,-12(%ebp)//存储新(下一个)地址

mov -12(%ebp) edx

%edx 通过 mov ,因此对此加载的结果没有依赖关系。 ( mov 的目的地是只写的,所以它会破坏依赖链,这要归功于寄存器重命名。)



所以这个额外的负载会使 lea 循环到与 add 循环,但具有不同的延迟。如果额外负载对速度没有影响,我们知道第一个循环不会在加载/存储吞吐量上造成瓶颈。






更新:OP的测试证实,额外未使用的负载会减慢 lea 循环到与 add



为什么当我们不碰到执行端口吞吐量瓶颈时,额外的uop很重要



按照最旧的第一顺序(已经准备好其操作数的uop)调度,而不是按照关键路径优先顺序。可能在以后的空闲周期中完成的额外的uop将实际上延迟在关键路径上的uop(例如循环传送的依赖性的一部分)。这称为资源冲突,可能会增加关键路径的延迟。



而不是等待一个周期,其中关键路径延迟离开一个没有做任何事情的加载端口,未使用的负载将运行,当它是最旧的负载与其加载地址准备好。



同样,在 add 循环中,额外的负载是关键路径的一部分,额外的负载会导致更多的资源冲突,从而延迟关键路径上的操作。






其他猜测:



所以也许有更快的存储地址是什么,所以内存操作流水线更好。 (例如,当接近页边界时,即使正常的硬件预取也不跨越页边界,即使它们在TLB中是热的,循环触摸4MiB的存储器,这足以用于这种类型的事实上,L3延迟足够高,可能会创建一个管道泡沫,或者如果你的L3是小的,那么主内存肯定是



或者额外的延迟使得乱序执行更难以做好工作。


TL;DR: the first loop runs ~18% faster on a Haswell CPU. Why? The loops are from gcc -O0 (un-optimized) loops using ptr++ vs ++ptr, but the question is why the resulting asm performs differently, not anything about how to write better C.


Let's say we have those two loops:

    movl    $0, -48(%ebp)     //Loop counter set to 0
    movl    $_data, -12(%ebp) //Pointer to the data array
    movl    %eax, -96(%ebp)
    movl    %edx, -92(%ebp)
    jmp L21
L22:
    // ptr++
    movl    -12(%ebp), %eax   //Get the current address
    leal    4(%eax), %edx     //Calculate the next address
    movl    %edx, -12(%ebp)   //Store the new (next) address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx   //Get the loop counter to edx
    movl    %edx, (%eax)      //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
    addl    $1, -48(%ebp)     //Increase the counter
L21:
    cmpl    $999999, -48(%ebp)
    jle     L22

and the second one:

    movl    %eax, -104(%ebp)
    movl    %edx, -100(%ebp)
    movl    $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
    movl    $0, -48(%ebp)       //Set the loop counter to 0
    jmp L23
L24:
    // ++ptr
    addl    $4, -12(%ebp)       //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
    movl    -12(%ebp), %eax     //Store in eax the address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx     //Store in edx the current loop counter
    movl    %edx, (%eax)        //Move the loop counter value to the current stored address location
    addl    $1, -48(%ebp)       //Increase the loop counter
L23:
    cmpl    $999999, -48(%ebp)
    jle L24

Those loops are doing exactly the same thing but in a little different manner, please refer to the comment for the details.

This asm code is generated from the following two C++ loops:

    //FIRST LOOP:
    for(;index<size;index++){
        *(ptr++) = index;
    }
    //SECOND LOOP:
    ptr = data - 1;
    for(index = 0;index<size;index++){
        *(++ptr) = index;
    }

Now, the first loop is about ~18% faster than the second one, no matter in which order the loops are performed the one with ptr++ is faster than the one with ++ptr.

To run my benchmarks I just collected the running time of those loops for different size, and executing them both nested in other loops to repeat the operation frequently.


ASM analysis

Looking at the ASM code, the second loop contains less instructions, we have 3 movl and 2 addl whereas in the first loop we have 4 movl one addl and one leal, so we have one movl more and one leal instead of addl

Is that correct that the LEA operation for computing the correct address is much faster than the ADD (+4) method? Is this the reason for the difference in performance?

As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding, whereas in the first loop we can immediately refer the memory and in the meanwhile LEAL will compute the next address (some kind of better pipeline performance here).

Is there some reordering going on here? I'm not sure about my explanation for the performance difference of those loops, can I have your opinion?

解决方案

First of all, performance analysis on -O0 compiler output is usually not very interesting or useful.


Is that correct that the LEAL operation for computing the correct address is much faster than the ADDL (+4) method? Is this the reason for the difference in performance?

Nope, add can run on every ALU execution port on any x86 CPU. lea is usually as low latency with simple addressing modes, but not as good throughput. On Atom, it runs in a different stage of the pipeline from normal ALU instructions, because it actually lives up to its name and uses the AGU on that in-order microarchitecture.

See the tag wiki to learn what makes code slow or fast on different microarchitectures, esp. Agner Fog's microarchitecture pdf and instruction tables.

add is only worse because it lets gcc -O0 make even worse code by using it with a memory destination and then loading from that.


Compiling with -O0 doesn't even try to use the best instructions for the job. e.g. you'll get mov $0, %eax instead of the xor %eax,%eax you always get in optimized code. You shouldn't infer anything about what's good from looking at un-optimized compiler output.

-O0 code is always full of bottlenecks, usually on load/store or store-forwarding. Unfortunately IACA doesn't account for store-forwarding latency, so it doesn't realize that these loops actually bottleneck on


As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding,

Yes, the mov load of -12(%ebp) won't be ready for about 6 cycles after the load that was part of add's read-modify-write.

whereas in the first loop we can immediately refer the memory

Yes

and in the meanwhile LEAL will compute the next address

No.

Your analysis is close, but you missed the fact that the next iteration still has to load the value we stored into -12(%ebp). So the loop-carried dependency chain is the same length, and next iteration's lea can't actually start any sooner than in the loop using add


The latency issues might not be the loop throughput bottleneck:

uop / execution port throughput needs to be considered. In this case, the OP's testing shows it's actually relevant. (Or latency from resource conflicts.)

When gcc -O0 implements ptr++, it keeps the old value in a register, like you said. So store addresses are known further ahead of time, and there's one fewer load uop that needs an AGU.

Assuming an Intel SnB-family CPU:

## ptr++: 1st loop
movl    -12(%ebp), %eax   //1 uop (load)
leal    4(%eax), %edx     //1 uop (ALU only)
movl    %edx, -12(%ebp)   //1 store-address, 1 store-data
//   no load from -12(%ebp) into %eax
... rest the same.


 ## ++ptr:  2nd loop
addl    $4, -12(%ebp)       // read-modify-write: 2 fused-domain uops.  4 unfused: 1 load + 1 store-address + 1 store-data
movl    -12(%ebp), %eax     // load: 1 uop.   ~6 cycle latency for %eax to be ready
... rest the same

So the pointer-increment part of the 2nd loop has one more load uop. Probably the code bottlenecks on AGU throughput (address-generation units). IACA says that's the case for arch=SNB, but that HSW bottlenecks on store-data throughput (not AGUs).

However, without taking store-forwarding latency into account, IACA says the first loop can run at one iteration per 3.5 cycles, vs. one per 4 cycles for the second loop. That's faster than the 6 cycle loop-carried dependency of the addl $1, -48(%ebp) loop counter, which indicates that the loop is bottlenecked by latency to less than max AGU throughput. (Resource conflicts probably mean it actually runs slower than one iteration per 6c, see below).

We could test this theory:

Adding an extra load uop to the lea version, off the critical path, would take more throughput, but wouldn't be part of the loop's latency chains. e.g.

movl    -12(%ebp), %eax   //Get the current address
leal    4(%eax), %edx     //Calculate the next address
movl    %edx, -12(%ebp)   //Store the new (next) address

mov     -12(%ebp), %edx 

%edx is about to be overwritten by a mov, so there are no dependencies on the result of this load. (The destination of mov is write-only, so it breaks dependency chains, thanks to register renaming.).

So this extra load would bring the lea loop up to the same number and flavour of uops as the add loop, but with different latency. If the extra load has no effect on speed, we know the first loop isn't bottlenecked on load / store throughput.


Update: OP's testing confirmed that an extra unused load slows the lea loop down to about the same speed as the add loop.

Why extra uops matter when we're not hitting execution port throughput bottlenecks

uops are scheduled in oldest-first order (out of uops that have their operands ready), not in critical-path-first order. Extra uops that could have been done in a spare cycle later on will actually delay uops that are on the critical path (e.g. part of the loop-carried dependency). This is called a resource conflict, and can increase the latency of the critical path.

i.e. instead of waiting for a cycle where critical-path latency left a load port with nothing to do, the unused load will run when it's the oldest load with its load-address ready. This will delay other loads.

Similarly, in the add loop where the extra load is part of the critical path, the extra load causes more resource conflicts, delaying operations on the critical path.


Other guesses:

So maybe having the store address ready sooner is what's doing it, so memory operations are pipelined better. (e.g. TLB-miss page walks can start sooner when approaching a page boundary. Even normal hardware prefetching doesn't cross page boundaries, even if they are hot in the TLB. The loop touches 4MiB of memory, which is enough for this kind of thing to matter. L3 latency is high enough to maybe create a pipeline bubble. Or if your L3 is small, then main memory certainly is.

Or maybe the extra latency just makes it harder for out-of-order execution to do a good job.

这篇关于为什么在asm中这种差异对性能有影响(在未优化的ptr ++与++ ptr循环中)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆