X86 64 位模式下的索引分支开销 [英] Indexed branch overhead on X86 64 bit mode

查看:30
本文介绍了X86 64 位模式下的索引分支开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对之前线程中的一些评论的跟进:

This is a follow up to some comments made in this prior thread:

递归斐波那契装配

以下代码片段计算斐波那契数,第一个示例带有循环,第二个示例带有计算跳转(索引分支)到展开的循环.这是在带有 Intel 3770K 3.5ghz 处理器的 Windows 7 Pro 64 位模式下使用 Visual Studio 2015 Desktop Express 进行测试的.通过 fib(0) 到 fib(93) 的单循环测试,我得到的循环版本的最佳时间是 ~1.901 微秒,计算跳转的最佳时间是 ~1.324 微秒.使用外循环将这个过程重复 1,048,576 次,循环版本需要大约 1.44 秒,计算跳转需要大约 1.04 秒.在两组测试中,循环版本比计算跳转版本慢约 40%.

The following code snippets calculate Fibonacci, the first example with a loop, the second example with a computed jump (indexed branch) into an unfolded loop. This was tested using Visual Studio 2015 Desktop Express on Windows 7 Pro 64 bit mode with an Intel 3770K 3.5ghz processor. With a single loop testing fib(0) thru fib(93), the best time I get for loop version is ~1.901 microseconds, and for computed jump is ~ 1.324 microseconds. Using an outer loop to repeat this process 1,048,576 times, the loop version takes about 1.44 seconds, the computed jump takes about 1.04 seconds. In both sets of tests, the loop version is about 40% slower than computed jump version.

问题:为什么循环版本比计算跳转版本对代码位置更敏感?在之前的测试中,一些代码位置组合导致循环版本时间从大约 1.44 秒增加到 1.93 秒,但我从来没有发现一个组合会显着影响计算的跳转版本时间.

Question: Why is the loop version much more sensitive to code location than the computed jump version? In prior tests, some code location combinations caused the loop version time to increase from about 1.44 seconds to 1.93 seconds, but I never found a combination that significantly affected the computed jump version time.

部分答案:计算出的跳转版本分支到 280 字节范围内的 94 个可能的目标位置,显然分支目标缓冲区(缓存)在优化此方面做得很好.对于循环版本,使用 align 16 将基于程序集的 fib() 函数放在 16 字节边界上解决了大多数情况下的循环版本时间问题,但对 main() 的一些更改仍然影响时间.我需要找到一个相当小且可重复的测试用例.

Partial answer: The computed jump version branches into 94 possible target locations within a 280 byte range, and apparently a branch target buffer (cache) does a good job of optimizing this. For the loop version, using align 16 to put the assembly based fib() function on a 16 byte boundary solved the loop version time issue for most cases, but some changes to main() were still affecting the time. I need to find a reasonably small and repeatable test case.

循环版本(注意我读过 | dec | jnz | 比 | loop | 快):

loop version (note I've read that | dec | jnz | is faster than | loop |) :

        align   16
fib     proc                            ;rcx == n
        mov     rax,rcx                 ;br if < 2
        cmp     rax,2
        jb      fib1
        mov     rdx,1                   ;set rax, rdx
        and     rax,rdx
        sub     rdx,rax
        shr     rcx,1
fib0:   add     rdx,rax
        add     rax,rdx
        dec     rcx
        jnz     fib0
fib1:   ret     
fib     endp

计算跳转(索引分支)到展开循环版本:

computed jump (indexed branch) into unfolded loop version:

        align   16
fib     proc                            ;rcx == n
        mov     r8,rcx                  ;set jmp adr
        mov     r9,offset fib0+279
        lea     r8,[r8+r8*2]
        neg     r8
        add     r8,r9
        mov     rax,rcx                 ;set rax,rdx
        mov     rdx,1
        and     rax,rdx
        sub     rdx,rax
        jmp     r8
fib0:   ; assumes add xxx,xxx takes 3 bytes
        rept    46
        add     rax,rdx
        add     rdx,rax
        endm
        add     rax,rdx
        ret
fib     endp

运行 100 万 (1048576) 次循环以使用 37%93 的倍数计算 fib(0)fib(93) 的测试代码,因此顺序不是连续的.在我的系统上,循环版本大约需要 1.44 秒,索引分支版本大约需要 1.04 秒.

Test code that runs 1 million (1048576) loops to calculate fib(0) to fib(93) using multiples of 37%93 so the order is not sequential. On my system, the loop version took about 1.44 seconds and the indexed branch version took about 1.04 seconds.

#include <stdio.h>
#include <time.h>

typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;

extern "C" uint64_t fib(uint64_t);

/* multiples of 37 mod 93 + 93 at end */
static uint64_t a[94] = 
     {0,37,74,18,55,92,36,73,17,54,
     91,35,72,16,53,90,34,71,15,52,
     89,33,70,14,51,88,32,69,13,50,
     87,31,68,12,49,86,30,67,11,48,
     85,29,66,10,47,84,28,65, 9,46,
     83,27,64, 8,45,82,26,63, 7,44,
     81,25,62, 6,43,80,24,61, 5,42,
     79,23,60, 4,41,78,22,59, 3,40,
     77,21,58, 2,39,76,20,57, 1,38,
     75,19,56,93};

/* x used to avoid compiler optimizing out result of fib() */
int main()
{
size_t i, j;
clock_t cbeg, cend;
uint64_t x = 0;
    cbeg = clock();
    for(j = 0; j < 0x100000; j++)
        for(i = 0; i < 94; i++)
            x += fib(a[i]);
    cend = clock();
    printf("%llx
", x);
    printf("# ticks = %u
", (uint32_t)(cend-cbeg));
    return 0;
}

x 的输出是 0x812a62b1dc000000.fib(0) 到 fib(93) 的十六进制总和为 0x1bb433812a62b1dc0,再添加 5 个零以循环 0x100000 次:0x1bb433812a62b1dc000000.由于 64 位数学运算,高 6 个半字节被截断.

The output for x is 0x812a62b1dc000000. The sum of fib(0) to fib(93) in hex is 0x1bb433812a62b1dc0, and add 5 more zeros for looping 0x100000 times: 0x1bb433812a62b1dc000000. The upper 6 nibbles are truncated due to 64 bit math.

我制作了一个全汇编版本以更好地控制代码位置.对于循环版本,if 1"更改为if 0".循环版本大约需要 1.465 到 2.000 秒,具体取决于用于将关键位置放在偶数或奇数 16 字节边界上的 nop 填充(请参阅下面的评论).计算出的跳跃版本大约需要 1.04 秒,并且边界在时间上的差异小于 1%.

I made an all assembly version to better control code location. The "if 1" is changed to "if 0" for loop version. The loop version takes about 1.465 to 2.000 seconds depending on nop padding used to put key locations on even or odd 16 byte boundaries (see comments below). The computed jump version takes about 1.04 seconds and boundaries make less than 1% difference in timing.

        includelib msvcrtd
        includelib oldnames

        .data
; multiples of 37 mod 93 + 93 at the end
a       dq      0,37,74,18,55,92,36,73,17,54
        dq     91,35,72,16,53,90,34,71,15,52
        dq     89,33,70,14,51,88,32,69,13,50
        dq     87,31,68,12,49,86,30,67,11,48
        dq     85,29,66,10,47,84,28,65, 9,46
        dq     83,27,64, 8,45,82,26,63, 7,44
        dq     81,25,62, 6,43,80,24,61, 5,42
        dq     79,23,60, 4,41,78,22,59, 3,40
        dq     77,21,58, 2,39,76,20,57, 1,38
        dq     75,19,56,93
        .data?
        .code
;       parameters      rcx,rdx,r8,r9
;       not saved       rax,rcx,rdx,r8,r9,r10,r11
;       code starts on 16 byte boundary
main    proc
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbp
        mov     rbp,rsp
        and     rsp,0fffffffffffffff0h
        sub     rsp,64
        mov     r15,offset a
        xor     r14,r14
        mov     r11,0100000h
;       nop padding effect on loop version (with 0 padding in padx below)
;        0 puts main2 on  odd 16 byte boundary  clk = 0131876622h => 1.465 seconds
;        9 puts main1 on  odd 16 byte boundary  clk = 01573FE951h => 1.645 seconds
        rept    0
        nop
        endm
        rdtsc
        mov     r12,rdx
        shl     r12,32
        or      r12,rax
main0:  xor     r10,r10
main1:  mov     rcx,[r10+r15]
        call    fib
main2:  add     r14,rax
        add     r10,8
        cmp     r10,8*94
        jne     main1
        dec     r11
        jnz     main0
        rdtsc
        mov     r13,rdx
        shl     r13,32
        or      r13,rax
        sub     r13,r12
        mov     rdx,r14
        xor     rax,rax
        mov     rsp,rbp
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret
main    endp

        align   16
padx    proc
;       nop padding effect on loop version with 0 padding above
;        0 puts fib on  odd 16 byte boundary    clk = 0131876622h => 1.465 seconds
;       16 puts fib on even 16 byte boundary    clk = 01A13C8CB8h => 2.000 seconds
;       nop padding effect on computed jump version with 9 padding above
;        0 puts fib on  odd 16 byte boundary    clk = 00D979792Dh => 1.042 seconds
;       16 puts fib on even 16 byte boundary    clk = 00DA93E04Dh => 1.048 seconds
        rept    0
        nop
        endm
padx    endp

        if      1       ;0 = loop version, 1 = computed jump version

fib     proc                            ;rcx == n
        mov     r8,rcx                  ;set jmp adr
        mov     r9,offset fib0+279
        lea     r8,[r8+r8*2]
        neg     r8
        add     r8,r9
        mov     rax,rcx                 ;set rax,rdx
        mov     rdx,1
        and     rax,rdx
        sub     rdx,rax
        jmp     r8
fib0:   ; assumes add xxx,xxx takes 3 bytes
        rept    46
        add     rax,rdx
        add     rdx,rax
        endm
        add     rax,rdx
        ret
fib     endp

        else

fib     proc                            ;rcx == n
        mov     rax,rcx                 ;br if < 2
        cmp     rax,2
        jb      fib1
        mov     rdx,1                   ;set rax, rdx
        and     rax,rdx
        sub     rdx,rax
        shr     rcx,1
fib0:   add     rdx,rax
        add     rax,rdx
        dec     rcx
        jnz     fib0
fib1:   ret     
fib     endp

        endif
        end

推荐答案

这是对原始问题的回答,即为什么在结果完全未使用的情况下,循环所需的时间是计算跳转版本的 1.4 倍.IDK 究竟为什么用 1 周期 add 循环携带的依赖链来累积结果会产生如此大的不同.尝试有趣的事情:将它存储到内存中(例如,将它分配给 volatile int discard),这样 asm dep 链就不会只是以损坏的寄存器结束.硬件可能会对此进行优化(例如,一旦确定结果无效,就丢弃 uops).英特尔表示 Sandybridge 系列可以为标志结果 uops 之一做到这一点在 shl reg,cl.

This was an answer to the original question, about why the loop takes 1.4x the time of the computed-jump version when the result is totally unused. IDK exactly why accumulating the result with a 1-cycle add loop-carried dependency chain would make so much difference. Interesting things to try: store it to memory (e.g. assign it to a volatile int discard) so the asm dep chain doesn't just end with a clobbered register. HW might possibly optimize that (e.g. discard uops once it's sure the result is dead). Intel says Sandybridge-family can do that for one of the flag-result uops in shl reg,cl.

旧答案:为什么计算的跳转比结果未使用的循环快 1.4 倍

您在这里测试的是吞吐量,而不是延迟.在我们之前的讨论中,我主要关注延迟.那可能是个错误;吞吐量对调用方的影响通常与延迟一样重要,具体取决于调用方执行的操作中有多少对结果有数据依赖性.

You're testing throughput here, not latency. In our earlier discussion, I was mostly focusing on latency. That may have been a mistake; throughput impact on the caller can often be as relevant as latency, depending on how much of what the caller does after has a data dependency on the result.

乱序执行隐藏了延迟,因为一次调用的结果不是参数对下一次调用的输入依赖性.并且 IvyBridge 的乱序窗口足够大,可以在这里使用:168-entry ROB (从发布到退役),以及 54 个条目的调度程序(从发布到执行),以及一个 160 个条目的物理寄存器文件.另请参阅针对 OOO 窗口大小的 PRF 与 ROB 限制.

Out-of-order execution hides the latency because the result of one call isn't an input dependency for the arg to the next call. And IvyBridge's out-of-order window is large enough to be useful here: 168-entry ROB (from issue to retirement), and 54-entry scheduler (from issue to execute), and a 160-entry physical register file. See also PRF vs. ROB limits for OOO window size.

OOO 执行还隐藏了在任何 Fib 工作完成之前分支错误预测的成本.last fib(n) dep 链中的工作仍在进行中,并且在该错误预测期间仍在进行中.(现代 Intel CPU 只回滚到错误预测的分支,并且可以在错误预测得到解决的同时继续执行分支之前的 uops.)

OOO execution also hides the cost of the branch-mispredict before any Fib work gets done. Work from the last fib(n) dep chain is still in flight and being worked on during that mispredict. (Modern Intel CPUs only roll back to the mispredicted branch, and can keep executing uops from before the branch while the mispredict is being resolved.)

计算分支版本在这里很好是有道理的,因为你主要在 uop 吞吐量上遇到瓶颈,并且循环退出分支的错误预测成本与进入展开的间接分支错误预测的成本大致相同版本.IvB 可以将 sub/jcc 宏融合到端口 5 的单个 uop 中,因此 40% 的数字非常匹配.(3 个 ALU 执行单元,因此在循环开销上花费 1/3 或您的 ALU 执行吞吐量可以解释这一点.分支错误预测差异和 OOO 执行的限制解释了其余部分)

It makes sense that the computed-branch version is good here, because you're mostly bottlenecked on uop throughput, and the mispredict from the loop-exit branch costs about the same as the indirect-branch mispredict on entry to the unrolled version. IvB can macro-fuse the sub/jcc into a single uop for port 5, so the 40% number matches up pretty well. (3 ALU execution units, so spending 1/3 or your ALU execution throughput on loop overhead explains it. Branch-mispredict differences and the limits of OOO execution explain the rest)

我认为在大多数实际用例中,延迟可能是相关的.也许吞吐量仍然是最重要的,但除此之外的任何事情都会使延迟重要,因为这甚至根本不使用结果.当然,在从间接分支错误预测中恢复时,管道中会有先前的工作可以处理是正常的,但这会延迟结果准备就绪,这可能意味着如果 fib() 返回取决于结果.但是如果它们不是(例如,大量重新加载和计算地址放置结果的位置),让前端开始从 fib() 越早越好.

I think in most real use-cases, latency might will relevant. Maybe throughput will still be most important, but anything other than this will make latency more important, because this doesn't even use the result at all. Of course, it's normal that there will be previous work in the pipeline that can be worked on while an indirect-branch mispredict is recovered from, but this will delay the result being ready which might mean stalls later if most of the instructions after fib() returns are dependent on the result. But if they aren't (e.g. a lot of reloads and calculations of addresses for where to put the result), having the front-end start issuing uops from after fib() sooner is a good thing.

我认为这里的一个好的中间立场是展开 4 或 8,在展开循环之前进行检查以确保它应该运行一次.(例如 sub rcx,8/jb .cleanup).

I think a good middle ground here would be an unroll by 4 or 8, with a check before the unrolled loop to make sure it should run once. (e.g. sub rcx,8 / jb .cleanup).

另请注意,您的循环版本对 n 的初始值具有数据依赖性.在我们之前的讨论中,我指出避免这种情况会更好对于乱序执行,因为它让 add 链在 n 准备好之前开始工作.我不认为这是一个很大的因素,因为调用者对于 n 的延迟很低.但它确实将循环分支错误预测放在 n -> fib(n) dep 链的末尾而不是中间.(我正在想象一个无分支的 lea/cmov 在循环之后如果 sub ecx, 2 低于零而不是零.)

Also note that your looping version has a data dependency on n for the initial values. In our earlier discussion, I pointed out that avoiding this would be better for out-of-order execution, because it lets the add chain start working before n is ready. I don't think that's a big factor here, because the caller has low latency for n. But it does put the loop-branch mispredict on exiting the loop at the end of the n -> fib(n) dep chain instead of in the middle. (I'm picturing a branchless lea / cmov after the loop to do one more iteration if sub ecx, 2 went below zero instead of to zero.)

这篇关于X86 64 位模式下的索引分支开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆