X86 64位模式下的索引分支开销 [英] Indexed branch overhead on X86 64 bit mode

查看:77
本文介绍了X86 64位模式下的索引分支开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是该先前主题中的一些评论的后续内容:

This is a follow up to some comments made in this prior thread:

递归斐波那契汇编

以下代码段计算斐波那契,第一个示例使用循环,第二个示例使用计算的跳转(索引分支)进入展开的循环.已使用Windows 7 Pro 64位模式下的Visual Studio 2015 Desktop Express和Intel 3770K 3.5ghz处理器对此进行了测试.通过对fib(0)至fib(93)进行单循环测试,获得循环版本的最佳时间为〜1.901微秒,而计算出的跃迁的最佳时间为〜1.324微秒.使用外部循环重复此过程1,048,576次,循环版本大约需要1.44秒,计算的跳转大约需要1.04秒.在这两组测试中,循环版本都比计算的跳转版本慢40%.

The following code snippets calculate Fibonacci, the first example with a loop, the second example with a computed jump (indexed branch) into an unfolded loop. This was tested using Visual Studio 2015 Desktop Express on Windows 7 Pro 64 bit mode with an Intel 3770K 3.5ghz processor. With a single loop testing fib(0) thru fib(93), the best time I get for loop version is ~1.901 microseconds, and for computed jump is ~ 1.324 microseconds. Using an outer loop to repeat this process 1,048,576 times, the loop version takes about 1.44 seconds, the computed jump takes about 1.04 seconds. In both sets of tests, the loop version is about 40% slower than computed jump version.

问题:为什么循环版本对代码位置比计算的跳转版本敏感得多?在先前的测试中,某些代码位置组合导致循环版本时间从大约1.44秒增加到1.93秒,但我从未发现会严重影响计算的跳转版本时间的组合.

Question: Why is the loop version much more sensitive to code location than the computed jump version? In prior tests, some code location combinations caused the loop version time to increase from about 1.44 seconds to 1.93 seconds, but I never found a combination that significantly affected the computed jump version time.

部分答案:计算出的跳转版本会在280字节范围内分支到94个可能的目标位置,并且显然分支目标缓冲区(高速缓存)在优化此目标方面做得很好.对于循环版本,使用align 16将基于程序集的fib()函数放在16字节边界上可解决大多数情况下的循环版本时间问题,但对main()的某些更改仍会影响时间.我需要找到一个相当小的且可重复的测试用例.

Partial answer: The computed jump version branches into 94 possible target locations within a 280 byte range, and apparently a branch target buffer (cache) does a good job of optimizing this. For the loop version, using align 16 to put the assembly based fib() function on a 16 byte boundary solved the loop version time issue for most cases, but some changes to main() were still affecting the time. I need to find a reasonably small and repeatable test case.

循环版本(请注意,我读过| | dec | jnz |比| loop |快):

loop version (note I've read that | dec | jnz | is faster than | loop |) :

        align   16
fib     proc                            ;rcx == n
        mov     rax,rcx                 ;br if < 2
        cmp     rax,2
        jb      fib1
        mov     rdx,1                   ;set rax, rdx
        and     rax,rdx
        sub     rdx,rax
        shr     rcx,1
fib0:   add     rdx,rax
        add     rax,rdx
        dec     rcx
        jnz     fib0
fib1:   ret     
fib     endp

计算跳转(索引分支)到展开的循环版本:

computed jump (indexed branch) into unfolded loop version:

        align   16
fib     proc                            ;rcx == n
        mov     r8,rcx                  ;set jmp adr
        mov     r9,offset fib0+279
        lea     r8,[r8+r8*2]
        neg     r8
        add     r8,r9
        mov     rax,rcx                 ;set rax,rdx
        mov     rdx,1
        and     rax,rdx
        sub     rdx,rax
        jmp     r8
fib0:   ; assumes add xxx,xxx takes 3 bytes
        rept    46
        add     rax,rdx
        add     rdx,rax
        endm
        add     rax,rdx
        ret
fib     endp

运行100万次(1048576)循环的测试代码,使用37%93的倍数计算fib(0)fib(93),因此顺序不是连续的.在我的系统上,循环版本大约需要1.44秒,而索引分支版本大约需要1.04秒.

Test code that runs 1 million (1048576) loops to calculate fib(0) to fib(93) using multiples of 37%93 so the order is not sequential. On my system, the loop version took about 1.44 seconds and the indexed branch version took about 1.04 seconds.

#include <stdio.h>
#include <time.h>

typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;

extern "C" uint64_t fib(uint64_t);

/* multiples of 37 mod 93 + 93 at end */
static uint64_t a[94] = 
     {0,37,74,18,55,92,36,73,17,54,
     91,35,72,16,53,90,34,71,15,52,
     89,33,70,14,51,88,32,69,13,50,
     87,31,68,12,49,86,30,67,11,48,
     85,29,66,10,47,84,28,65, 9,46,
     83,27,64, 8,45,82,26,63, 7,44,
     81,25,62, 6,43,80,24,61, 5,42,
     79,23,60, 4,41,78,22,59, 3,40,
     77,21,58, 2,39,76,20,57, 1,38,
     75,19,56,93};

/* x used to avoid compiler optimizing out result of fib() */
int main()
{
size_t i, j;
clock_t cbeg, cend;
uint64_t x = 0;
    cbeg = clock();
    for(j = 0; j < 0x100000; j++)
        for(i = 0; i < 94; i++)
            x += fib(a[i]);
    cend = clock();
    printf("%llx\n", x);
    printf("# ticks = %u\n", (uint32_t)(cend-cbeg));
    return 0;
}

x的输出为0x812a62b1dc000000.以十六进制表示的fib(0)与fib(93)的总和为0x1bb433812a62b1dc0,并为循环0x100000次添加了另外5个零:0x1bb433812a62b1dc000000.由于64位数学运算,高6位半字节被截断了.

The output for x is 0x812a62b1dc000000. The sum of fib(0) to fib(93) in hex is 0x1bb433812a62b1dc0, and add 5 more zeros for looping 0x100000 times: 0x1bb433812a62b1dc000000. The upper 6 nibbles are truncated due to 64 bit math.

我制作了所有程序集版本,以更好地控制代码位置.对于循环版本,"if 1"更改为"if 0".循环版本大约需要1.465到2.000秒,具体取决于用于将密钥位置放置在偶数或奇数16字节边界上的nop填充(请参阅下面的注释).计算出的跳跃版本大约需要1.04秒,并且边界的计时差异不到1%.

I made an all assembly version to better control code location. The "if 1" is changed to "if 0" for loop version. The loop version takes about 1.465 to 2.000 seconds depending on nop padding used to put key locations on even or odd 16 byte boundaries (see comments below). The computed jump version takes about 1.04 seconds and boundaries make less than 1% difference in timing.

        includelib msvcrtd
        includelib oldnames

        .data
; multiples of 37 mod 93 + 93 at the end
a       dq      0,37,74,18,55,92,36,73,17,54
        dq     91,35,72,16,53,90,34,71,15,52
        dq     89,33,70,14,51,88,32,69,13,50
        dq     87,31,68,12,49,86,30,67,11,48
        dq     85,29,66,10,47,84,28,65, 9,46
        dq     83,27,64, 8,45,82,26,63, 7,44
        dq     81,25,62, 6,43,80,24,61, 5,42
        dq     79,23,60, 4,41,78,22,59, 3,40
        dq     77,21,58, 2,39,76,20,57, 1,38
        dq     75,19,56,93
        .data?
        .code
;       parameters      rcx,rdx,r8,r9
;       not saved       rax,rcx,rdx,r8,r9,r10,r11
;       code starts on 16 byte boundary
main    proc
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbp
        mov     rbp,rsp
        and     rsp,0fffffffffffffff0h
        sub     rsp,64
        mov     r15,offset a
        xor     r14,r14
        mov     r11,0100000h
;       nop padding effect on loop version (with 0 padding in padx below)
;        0 puts main2 on  odd 16 byte boundary  clk = 0131876622h => 1.465 seconds
;        9 puts main1 on  odd 16 byte boundary  clk = 01573FE951h => 1.645 seconds
        rept    0
        nop
        endm
        rdtsc
        mov     r12,rdx
        shl     r12,32
        or      r12,rax
main0:  xor     r10,r10
main1:  mov     rcx,[r10+r15]
        call    fib
main2:  add     r14,rax
        add     r10,8
        cmp     r10,8*94
        jne     main1
        dec     r11
        jnz     main0
        rdtsc
        mov     r13,rdx
        shl     r13,32
        or      r13,rax
        sub     r13,r12
        mov     rdx,r14
        xor     rax,rax
        mov     rsp,rbp
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret
main    endp

        align   16
padx    proc
;       nop padding effect on loop version with 0 padding above
;        0 puts fib on  odd 16 byte boundary    clk = 0131876622h => 1.465 seconds
;       16 puts fib on even 16 byte boundary    clk = 01A13C8CB8h => 2.000 seconds
;       nop padding effect on computed jump version with 9 padding above
;        0 puts fib on  odd 16 byte boundary    clk = 00D979792Dh => 1.042 seconds
;       16 puts fib on even 16 byte boundary    clk = 00DA93E04Dh => 1.048 seconds
        rept    0
        nop
        endm
padx    endp

        if      1       ;0 = loop version, 1 = computed jump version

fib     proc                            ;rcx == n
        mov     r8,rcx                  ;set jmp adr
        mov     r9,offset fib0+279
        lea     r8,[r8+r8*2]
        neg     r8
        add     r8,r9
        mov     rax,rcx                 ;set rax,rdx
        mov     rdx,1
        and     rax,rdx
        sub     rdx,rax
        jmp     r8
fib0:   ; assumes add xxx,xxx takes 3 bytes
        rept    46
        add     rax,rdx
        add     rdx,rax
        endm
        add     rax,rdx
        ret
fib     endp

        else

fib     proc                            ;rcx == n
        mov     rax,rcx                 ;br if < 2
        cmp     rax,2
        jb      fib1
        mov     rdx,1                   ;set rax, rdx
        and     rax,rdx
        sub     rdx,rax
        shr     rcx,1
fib0:   add     rdx,rax
        add     rax,rdx
        dec     rcx
        jnz     fib0
fib1:   ret     
fib     endp

        endif
        end

推荐答案

这是对原始问题的答案,即为什么在完全未使用结果的情况下,为什么循环要花费1.4倍的计算跳转版本时间. IDK确切地解释了为什么使用1周期add循环承载的依赖链累加结果会产生如此大的差异.可以尝试做的有趣的事情:将其存储到内存中(例如,将其分配给volatile int discard),这样,asm dep链不仅会以破坏性的寄存器结尾.硬件可能会对此进行优化(例如,一旦确定结果无效,则丢弃uops). 英特尔表示,桑迪布里奇家族可以为其中一项有标志性结果的产品做到这一点在shl reg,cl 中.

This was an answer to the original question, about why the loop takes 1.4x the time of the computed-jump version when the result is totally unused. IDK exactly why accumulating the result with a 1-cycle add loop-carried dependency chain would make so much difference. Interesting things to try: store it to memory (e.g. assign it to a volatile int discard) so the asm dep chain doesn't just end with a clobbered register. HW might possibly optimize that (e.g. discard uops once it's sure the result is dead). Intel says Sandybridge-family can do that for one of the flag-result uops in shl reg,cl.

旧答案:为什么计算出的跳转比循环快1.4倍,而结果未使用

您要在此处测试吞吐量,而不是等待时间.在我们之前的讨论中,我主要关注延迟.那可能是个错误;吞吐量对调用者的影响通常与延迟一样重要,这取决于调用者之后执行的操作对结果的数据依赖性.

You're testing throughput here, not latency. In our earlier discussion, I was mostly focusing on latency. That may have been a mistake; throughput impact on the caller can often be as relevant as latency, depending on how much of what the caller does after has a data dependency on the result.

无序执行隐藏了延迟,因为一个调用的结果不是下一个调用的arg的输入依赖项.而且IvyBridge的乱序窗口足够大,可以在此处使用: 168条ROB(从发行到退役)和54个条目的调度程序(从发行到执行),以及一个160个条目的物理寄存器文件.另请参见 OOO窗口大小的PRF与ROB限制.

Out-of-order execution hides the latency because the result of one call isn't an input dependency for the arg to the next call. And IvyBridge's out-of-order window is large enough to be useful here: 168-entry ROB (from issue to retirement), and 54-entry scheduler (from issue to execute), and a 160-entry physical register file. See also PRF vs. ROB limits for OOO window size.

OOO执行还隐藏了任何Fib工作之前分支错误预测的成本. last fib(n) dep链中的工作仍在进行中,并且在此错误预测期间仍在进行中. (现代Intel CPU仅回滚到错误预测的分支,并且在解决错误预测时,可以继续从分支之前执行uops.)

OOO execution also hides the cost of the branch-mispredict before any Fib work gets done. Work from the last fib(n) dep chain is still in flight and being worked on during that mispredict. (Modern Intel CPUs only roll back to the mispredicted branch, and can keep executing uops from before the branch while the mispredict is being resolved.)

这里的计算分支版本很好,这是有道理的,因为您大多受到uop吞吐量的瓶颈,而来自loop-exit分支的错误预测的成本大约与进入未展开分支时的间接分支错误预测相同版本. IvB可以将sub/jcc宏融合到端口5的单个uop中,因此40%的数字匹配得很好. (3个ALU执行单元,因此在循环开销上花费1/3或您的ALU执行吞吐量可以对此进行解释.分支错误预测的差异和OOO执行的限制可以解释其余部分)

It makes sense that the computed-branch version is good here, because you're mostly bottlenecked on uop throughput, and the mispredict from the loop-exit branch costs about the same as the indirect-branch mispredict on entry to the unrolled version. IvB can macro-fuse the sub/jcc into a single uop for port 5, so the 40% number matches up pretty well. (3 ALU execution units, so spending 1/3 or your ALU execution throughput on loop overhead explains it. Branch-mispredict differences and the limits of OOO execution explain the rest)

我认为在大多数实际用例中,延迟可能会很重要.吞吐量可能仍然是最重要的,但是除此之外,其他任何因素都将使延迟变得更重要,因为这甚至根本不使用结果.当然,正常的做法是,在恢复间接分支错误预测后,可以在管道中进行先前的工作,但这会延迟结果的准备工作,这可能意味着如果返回取决于结果.但是,如果不是这样(例如,很多重载和地址放入的地址的计算),让前端从fib()之后开始发出uops是一件好事.

I think in most real use-cases, latency might will relevant. Maybe throughput will still be most important, but anything other than this will make latency more important, because this doesn't even use the result at all. Of course, it's normal that there will be previous work in the pipeline that can be worked on while an indirect-branch mispredict is recovered from, but this will delay the result being ready which might mean stalls later if most of the instructions after fib() returns are dependent on the result. But if they aren't (e.g. a lot of reloads and calculations of addresses for where to put the result), having the front-end start issuing uops from after fib() sooner is a good thing.

我认为这里的一个很好的中间立场是4或8展开,并在展开循环之前进行检查以确保它应该运行一次. (例如sub rcx,8/jb .cleanup).

I think a good middle ground here would be an unroll by 4 or 8, with a check before the unrolled loop to make sure it should run once. (e.g. sub rcx,8 / jb .cleanup).

还请注意,您的循环版本的初始值与n有数据依赖性.在我们前面的讨论中,我指出避免这种情况会更好乱序执行,因为它允许add链在n准备就绪之前开始工作.我认为这不是一个很大的因素,因为调用方的n延迟很短.但这确实使循环分支错误预测是在n-> fib(n) dep链的末尾而不是在中间退出循环. (我在循环后描绘了一个无分支的lea/cmov,如果sub ecx, 2低于零而不是零,则会再进行一次迭代.)

Also note that your looping version has a data dependency on n for the initial values. In our earlier discussion, I pointed out that avoiding this would be better for out-of-order execution, because it lets the add chain start working before n is ready. I don't think that's a big factor here, because the caller has low latency for n. But it does put the loop-branch mispredict on exiting the loop at the end of the n -> fib(n) dep chain instead of in the middle. (I'm picturing a branchless lea / cmov after the loop to do one more iteration if sub ecx, 2 went below zero instead of to zero.)

这篇关于X86 64位模式下的索引分支开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆