为什么在ASM无论性能这种差异(在未优化的PTR ++与++ PTR循环)? [英] Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?

查看:131
本文介绍了为什么在ASM无论性能这种差异(在未优化的PTR ++与++ PTR循环)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR :第一个循环运行〜18%的Haswell CPU更快。为什么?该循环是从的gcc -O0 (未优化)循环使用 PTR ++ VS ++ PTR ,但问题是,为什么得到的ASM执行方式不同,不是关于如何编写更好的C东西。

TL;DR: the first loop runs ~18% faster on a Haswell CPU. Why? The loops are from gcc -O0 (un-optimized) loops using ptr++ vs ++ptr, but the question is why the resulting asm performs differently, not anything about how to write better C.

让我们说我们有这两个循环:

Let's say we have those two loops:

    movl    $0, -48(%ebp)     //Loop counter set to 0
    movl    $_data, -12(%ebp) //Pointer to the data array
    movl    %eax, -96(%ebp)
    movl    %edx, -92(%ebp)
    jmp L21
L22:
    // ptr++
    movl    -12(%ebp), %eax   //Get the current address
    leal    4(%eax), %edx     //Calculate the next address
    movl    %edx, -12(%ebp)   //Store the new (next) address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx   //Get the loop counter to edx
    movl    %edx, (%eax)      //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
    addl    $1, -48(%ebp)     //Increase the counter
L21:
    cmpl    $999999, -48(%ebp)
    jle     L22

和第二个

    movl    %eax, -104(%ebp)
    movl    %edx, -100(%ebp)
    movl    $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
    movl    $0, -48(%ebp)       //Set the loop counter to 0
    jmp L23
L24:
    // ++ptr
    addl    $4, -12(%ebp)       //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
    movl    -12(%ebp), %eax     //Store in eax the address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx     //Store in edx the current loop counter
    movl    %edx, (%eax)        //Move the loop counter value to the current stored address location
    addl    $1, -48(%ebp)       //Increase the loop counter
L23:
    cmpl    $999999, -48(%ebp)
    jle L24

这些循环都在做同样的事情,但有一点不同的方式,请参阅详细的注释。

Those loops are doing exactly the same thing but in a little different manner, please refer to the comment for the details.

从以下两个C ++循环产生该ASM code:

This asm code is generated from the following two C++ loops:

    //FIRST LOOP:
    for(;index<size;index++){
        *(ptr++) = index;
    }
    //SECOND LOOP:
    ptr = data - 1;
    for(index = 0;index<size;index++){
        *(++ptr) = index;
    }

现在,第一个循环是关于〜比第二快18%,执行一个与 PTR ++ 无论在哪个订购的循环是一个比快与 ++ PTR

Now, the first loop is about ~18% faster than the second one, no matter in which order the loops are performed the one with ptr++ is faster than the one with ++ptr.

要运行我的基准,我只是收集这些循环针对不同的尺寸的,并执行他们两个嵌套在其他循环经常重复操作。

To run my benchmarks I just collected the running time of those loops for different size, and executing them both nested in other loops to repeat the operation frequently.

纵观ASM code,第二个循环包含较少的指令,我们有3个MOVL和2 ADDL而在第一循环中,我们有4个MOVL 1 ADDL和一个莱亚尔,所以我们有一个MOVL更多和一莱亚尔,而不是ADDL

Looking at the ASM code, the second loop contains less instructions, we have 3 movl and 2 addl whereas in the first loop we have 4 movl one addl and one leal, so we have one movl more and one leal instead of addl

是正确的,用于计算正确的地址 LEA 操作速度远远超过了添加(+4 ) 方法?这是在性能上的差异的原因?

Is that correct that the LEA operation for computing the correct address is much faster than the ADD (+4) method? Is this the reason for the difference in performance?

据我所知,之前曾经内存可以参考一些时钟周期必须经历一个新的地址进行计算,所以ADDL $ 4 -12(%EBP)后的第二个循环需要,然后再继续等待了一点,而在第一循环中,我们可以立即参考存储器,并在同时LEAL将计算下一个地址(某种这里更好管道性能)。

As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding, whereas in the first loop we can immediately refer the memory and in the meanwhile LEAL will compute the next address (some kind of better pipeline performance here).

有一些重新排序怎么回事?我不知道我的这些循环的性能差异的解释,我能有你的看法?

Is there some reordering going on here? I'm not sure about my explanation for the performance difference of those loops, can I have your opinion?

推荐答案

首先,对性能分析 -O0 编译器输出平时是不是很有趣或有用。

First of all, performance analysis on -O0 compiler output is usually not very interesting or useful.

是正确的,用于计算正确的地址LEAL操作比ADDL(+4)的方法要快得多?这是在性能上的差异的原因?

Is that correct that the LEAL operation for computing the correct address is much faster than the ADDL (+4) method? Is this the reason for the difference in performance?

不,添加可以在任何x86 CPU每ALU执行端口上运行。 LEA 通常是低延迟简单的寻址方式,但不作为良好的吞吐量。在凌动,它运行在不同的阶段,从正常的ALU指令流水线的,因为它实际上辜负它的名字,并使用AGU上有序微架构。

Nope, add can run on every ALU execution port on any x86 CPU. lea is usually as low latency with simple addressing modes, but not as good throughput. On Atom, it runs in a different stage of the pipeline from normal ALU instructions, because it actually lives up to its name and uses the AGU on that in-order microarchitecture.

查看 86 标记维基学习什么使得code缓慢或快速在不同的微架构,ESP。 瓦格纳雾的微架构的PDF和指令表的。

See the x86 tag wiki to learn what makes code slow or fast on different microarchitectures, esp. Agner Fog's microarchitecture pdf and instruction tables.

添加只有更糟,因为它可以让GCC -O0 使通过更糟code使用它与一个存储器目的地,然后从该装载

add is only worse because it lets gcc -O0 make even worse code by using it with a memory destination and then loading from that.

-O0 编译甚至没有尝试用最好的说明工作。例如你会得到 MOV $ 0,%EAX 而不是 XOR%eax中,%EAX 你总是在优化$得到C $℃。你不应该推断的任何的关于什么是好的,从寻找未优化的编译器输出。

Compiling with -O0 doesn't even try to use the best instructions for the job. e.g. you'll get mov $0, %eax instead of the xor %eax,%eax you always get in optimized code. You shouldn't infer anything about what's good from looking at un-optimized compiler output.

-O0 code总是充满了瓶颈,通常是在加载/存储或存储转发。不幸的是IACA不占存储转发延时,因此它不会意识到这些环路上

-O0 code is always full of bottlenecks, usually on load/store or store-forwarding. Unfortunately IACA doesn't account for store-forwarding latency, so it doesn't realize that these loops actually bottleneck on

据我所知,之前曾经内存可以参考一些时钟周期必须经历一个新的地址进行计算,所以ADDL $ 4 -12(%EBP)后的第二个循环需要,然后再继续等待了一点,

As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding,

是的,的 MOV 负荷 -12(%EBP)将不准备约6个周期这是一部分的后负荷添加的读 - 修改 - 写。

Yes, the mov load of -12(%ebp) won't be ready for about 6 cycles after the load that was part of add's read-modify-write.

而在第一循环中,我们可以立即将存储器

whereas in the first loop we can immediately refer the memory

和在此同时LEAL将计算下一个地址

and in the meanwhile LEAL will compute the next address

没有。

您的分析是接近,但你错过了一个事实,即下一次迭代仍然需要装入我们存入值-12(%EBP)。因此,循环携带依赖性链的长度相同,而下一次迭代的 LEA 不能实际使用添加启动任何早于在循环

Your analysis is close, but you missed the fact that the next iteration still has to load the value we stored into -12(%ebp). So the loop-carried dependency chain is the same length, and next iteration's lea can't actually start any sooner than in the loop using add

UOP /执行港口吞吐量需要考虑。在这种情况下,在OP的测试表明它实际上相关。 (或从资源冲突的等待时间。)

uop / execution port throughput needs to be considered. In this case, the OP's testing shows it's actually relevant. (Or latency from resource conflicts.)

在GCC -O0 工具 PTR ++ ,它保持在寄存器中的旧值,就像你说的。因此,存储地址被称为进一步提前,并且还有一个需要AGU少一个负载UOP。

When gcc -O0 implements ptr++, it keeps the old value in a register, like you said. So store addresses are known further ahead of time, and there's one fewer load uop that needs an AGU.

假设英特尔SNB家族CPU:

Assuming an Intel SnB-family CPU:

## ptr++: 1st loop
movl    -12(%ebp), %eax   //1 uop (load)
leal    4(%eax), %edx     //1 uop (ALU only)
movl    %edx, -12(%ebp)   //1 store-address, 1 store-data
//   no load from -12(%ebp) into %eax
... rest the same.


 ## ++ptr:  2nd loop
addl    $4, -12(%ebp)       // read-modify-write: 2 fused-domain uops.  4 unfused: 1 load + 1 store-address + 1 store-data
movl    -12(%ebp), %eax     // load: 1 uop.   ~6 cycle latency for %eax to be ready
... rest the same

所以,第二循环的指针增量部分有一个更大的负载UOP。大概AGU吞吐量(地址生成单位)code瓶颈。 IACA说,这对ARCH = SNB的情况,但上存储的数据吞吐量(非的AGU)HSW的瓶颈。

So the pointer-increment part of the 2nd loop has one more load uop. Probably the code bottlenecks on AGU throughput (address-generation units). IACA says that's the case for arch=SNB, but that HSW bottlenecks on store-data throughput (not AGUs).

然而,没有考虑存储转发延迟考虑在内,IACA说第一环路可以在一个迭代每3.5个周期运行,对于第二个循环每4个周期对之一。这比 ADDL $ 1,-48(%EBP)循环计数器的6周期循环携带依赖性,这表明环路由延迟瓶颈,以低于最高速度AGU吞吐量。 (资源冲突可能意味着它实际运行每6C慢于一个迭代,见下文)。

However, without taking store-forwarding latency into account, IACA says the first loop can run at one iteration per 3.5 cycles, vs. one per 4 cycles for the second loop. That's faster than the 6 cycle loop-carried dependency of the addl $1, -48(%ebp) loop counter, which indicates that the loop is bottlenecked by latency to less than max AGU throughput. (Resource conflicts probably mean it actually runs slower than one iteration per 6c, see below).

添加一个额外的负担UOP到 LEA 版本的从关键路径的,将采取更多的吞吐量,但不会成为其中的一部分环路的延迟链。例如。

Adding an extra load uop to the lea version, off the critical path, would take more throughput, but wouldn't be part of the loop's latency chains. e.g.

movl    -12(%ebp), %eax   //Get the current address
leal    4(%eax), %edx     //Calculate the next address
movl    %edx, -12(%ebp)   //Store the new (next) address

mov     -12(%ebp), %edx 

%EDX 即将被改写的 MOV ,所以有对这样的结果没有依赖加载。 (的目标 MOV 是只写的,所以它打破依存关系链,这要归功于寄存器重命名)。

%edx is about to be overwritten by a mov, so there are no dependencies on the result of this load. (The destination of mov is write-only, so it breaks dependency chains, thanks to register renaming.).

因此,这额外的负担将使 LEA 环达微指令作为添加循环,但用不同的时延。如果额外负载对速度没有影响,我们知道第一个循环是不是瓶颈在加载/存储吞吐量。

So this extra load would bring the lea loop up to the same number and flavour of uops as the add loop, but with different latency. If the extra load has no effect on speed, we know the first loop isn't bottlenecked on load / store throughput.

更新:OP的检测证实,额外的未使用的负载减缓了 LEA 循环下降到大约相同的速度为添加循环。

Update: OP's testing confirmed that an extra unused load slows the lea loop down to about the same speed as the add loop.

微指令被安排在最古老的优先顺序(出有自己的操作数准备微指令的),而不是在关键路径优先顺序。本来可以在备用周期已经完成以后将实际延迟是关键路径上(例如循环携带依赖性的一部分),微指令额外的微指令。这就是所谓的资源冲突,并能增加关键路径的延迟。

uops are scheduled in oldest-first order (out of uops that have their operands ready), not in critical-path-first order. Extra uops that could have been done in a spare cycle later on will actually delay uops that are on the critical path (e.g. part of the loop-carried dependency). This is called a resource conflict, and can increase the latency of the critical path.

即。而不是等待一个周期,其中关键路径延迟留下了负载端口无关,当它与负载地址准备最古老的负载未使用的负载运行。这将延迟等负载。

i.e. instead of waiting for a cycle where critical-path latency left a load port with nothing to do, the unused load will run when it's the oldest load with its load-address ready. This will delay other loads.

同样,在添加环路,将额外的负担是关键路径的一部分,额外负载会导致更多的资源冲突,关键路径上的延迟操作。

Similarly, in the add loop where the extra load is part of the critical path, the extra load causes more resource conflicts, delaying operations on the critical path.

其他的猜测:

因此​​,也许有实体店的地址准备越早是什么做的,所以内存操作流水线更好。 (例如TLB错过的页面散步可以更早地开始接近页边界时,即使正常的硬件prefetching不跨页边界,即使他们是在TLB热。环路触及的记忆4MiB,这对于已经足够这种事情无关紧要。L3延迟足够高,也许创建一个管道泡沫。或者,如果你的三层小,则主内存肯定是。

So maybe having the store address ready sooner is what's doing it, so memory operations are pipelined better. (e.g. TLB-miss page walks can start sooner when approaching a page boundary. Even normal hardware prefetching doesn't cross page boundaries, even if they are hot in the TLB. The loop touches 4MiB of memory, which is enough for this kind of thing to matter. L3 latency is high enough to maybe create a pipeline bubble. Or if your L3 is small, then main memory certainly is.

或许额外的延迟只是使它更难乱序执行做了很好的工作。

Or maybe the extra latency just makes it harder for out-of-order execution to do a good job.

这篇关于为什么在ASM无论性能这种差异(在未优化的PTR ++与++ PTR循环)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆