X86 64位模式下的索引分支开销 [英] Indexed branch overhead on X86 64 bit mode
问题描述
这是该先前主题中的一些评论的后续内容:
This is a follow up to some comments made in this prior thread:
以下代码段计算斐波那契,第一个示例使用循环,第二个示例使用计算的跳转(索引分支)进入展开的循环.已使用Windows 7 Pro 64位模式下的Visual Studio 2015 Desktop Express和Intel 3770K 3.5ghz处理器对此进行了测试.通过对fib(0)至fib(93)进行单循环测试,获得循环版本的最佳时间为〜1.901微秒,而计算出的跃迁的最佳时间为〜1.324微秒.使用外部循环重复此过程1,048,576次,循环版本大约需要1.44秒,计算的跳转大约需要1.04秒.在这两组测试中,循环版本都比计算的跳转版本慢40%.
The following code snippets calculate Fibonacci, the first example with a loop, the second example with a computed jump (indexed branch) into an unfolded loop. This was tested using Visual Studio 2015 Desktop Express on Windows 7 Pro 64 bit mode with an Intel 3770K 3.5ghz processor. With a single loop testing fib(0) thru fib(93), the best time I get for loop version is ~1.901 microseconds, and for computed jump is ~ 1.324 microseconds. Using an outer loop to repeat this process 1,048,576 times, the loop version takes about 1.44 seconds, the computed jump takes about 1.04 seconds. In both sets of tests, the loop version is about 40% slower than computed jump version.
问题:为什么循环版本对代码位置比计算的跳转版本敏感得多?在先前的测试中,某些代码位置组合导致循环版本时间从大约1.44秒增加到1.93秒,但我从未发现会严重影响计算的跳转版本时间的组合.
Question: Why is the loop version much more sensitive to code location than the computed jump version? In prior tests, some code location combinations caused the loop version time to increase from about 1.44 seconds to 1.93 seconds, but I never found a combination that significantly affected the computed jump version time.
部分答案:计算出的跳转版本会在280字节范围内分支到94个可能的目标位置,并且显然分支目标缓冲区(高速缓存)在优化此目标方面做得很好.对于循环版本,使用align 16将基于程序集的fib()函数放在16字节边界上可解决大多数情况下的循环版本时间问题,但对main()的某些更改仍会影响时间.我需要找到一个相当小的且可重复的测试用例.
Partial answer: The computed jump version branches into 94 possible target locations within a 280 byte range, and apparently a branch target buffer (cache) does a good job of optimizing this. For the loop version, using align 16 to put the assembly based fib() function on a 16 byte boundary solved the loop version time issue for most cases, but some changes to main() were still affecting the time. I need to find a reasonably small and repeatable test case.
循环版本(请注意,我读过| | dec
| jnz
|比| loop
|快):
loop version (note I've read that | dec
| jnz
| is faster than | loop
|) :
align 16
fib proc ;rcx == n
mov rax,rcx ;br if < 2
cmp rax,2
jb fib1
mov rdx,1 ;set rax, rdx
and rax,rdx
sub rdx,rax
shr rcx,1
fib0: add rdx,rax
add rax,rdx
dec rcx
jnz fib0
fib1: ret
fib endp
计算跳转(索引分支)到展开的循环版本:
computed jump (indexed branch) into unfolded loop version:
align 16
fib proc ;rcx == n
mov r8,rcx ;set jmp adr
mov r9,offset fib0+279
lea r8,[r8+r8*2]
neg r8
add r8,r9
mov rax,rcx ;set rax,rdx
mov rdx,1
and rax,rdx
sub rdx,rax
jmp r8
fib0: ; assumes add xxx,xxx takes 3 bytes
rept 46
add rax,rdx
add rdx,rax
endm
add rax,rdx
ret
fib endp
运行100万次(1048576)循环的测试代码,使用37%93的倍数计算fib(0)
至fib(93)
,因此顺序不是连续的.在我的系统上,循环版本大约需要1.44秒,而索引分支版本大约需要1.04秒.
Test code that runs 1 million (1048576) loops to calculate fib(0)
to fib(93)
using multiples of 37%93 so the order is not sequential. On my system, the loop version took about 1.44 seconds and the indexed branch version took about 1.04 seconds.
#include <stdio.h>
#include <time.h>
typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
extern "C" uint64_t fib(uint64_t);
/* multiples of 37 mod 93 + 93 at end */
static uint64_t a[94] =
{0,37,74,18,55,92,36,73,17,54,
91,35,72,16,53,90,34,71,15,52,
89,33,70,14,51,88,32,69,13,50,
87,31,68,12,49,86,30,67,11,48,
85,29,66,10,47,84,28,65, 9,46,
83,27,64, 8,45,82,26,63, 7,44,
81,25,62, 6,43,80,24,61, 5,42,
79,23,60, 4,41,78,22,59, 3,40,
77,21,58, 2,39,76,20,57, 1,38,
75,19,56,93};
/* x used to avoid compiler optimizing out result of fib() */
int main()
{
size_t i, j;
clock_t cbeg, cend;
uint64_t x = 0;
cbeg = clock();
for(j = 0; j < 0x100000; j++)
for(i = 0; i < 94; i++)
x += fib(a[i]);
cend = clock();
printf("%llx\n", x);
printf("# ticks = %u\n", (uint32_t)(cend-cbeg));
return 0;
}
x的输出为0x812a62b1dc000000.以十六进制表示的fib(0)与fib(93)的总和为0x1bb433812a62b1dc0,并为循环0x100000次添加了另外5个零:0x1bb433812a62b1dc000000.由于64位数学运算,高6位半字节被截断了.
The output for x is 0x812a62b1dc000000. The sum of fib(0) to fib(93) in hex is 0x1bb433812a62b1dc0, and add 5 more zeros for looping 0x100000 times: 0x1bb433812a62b1dc000000. The upper 6 nibbles are truncated due to 64 bit math.
我制作了所有程序集版本,以更好地控制代码位置.对于循环版本,"if 1"更改为"if 0".循环版本大约需要1.465到2.000秒,具体取决于用于将密钥位置放置在偶数或奇数16字节边界上的nop填充(请参阅下面的注释).计算出的跳跃版本大约需要1.04秒,并且边界的计时差异不到1%.
I made an all assembly version to better control code location. The "if 1" is changed to "if 0" for loop version. The loop version takes about 1.465 to 2.000 seconds depending on nop padding used to put key locations on even or odd 16 byte boundaries (see comments below). The computed jump version takes about 1.04 seconds and boundaries make less than 1% difference in timing.
includelib msvcrtd
includelib oldnames
.data
; multiples of 37 mod 93 + 93 at the end
a dq 0,37,74,18,55,92,36,73,17,54
dq 91,35,72,16,53,90,34,71,15,52
dq 89,33,70,14,51,88,32,69,13,50
dq 87,31,68,12,49,86,30,67,11,48
dq 85,29,66,10,47,84,28,65, 9,46
dq 83,27,64, 8,45,82,26,63, 7,44
dq 81,25,62, 6,43,80,24,61, 5,42
dq 79,23,60, 4,41,78,22,59, 3,40
dq 77,21,58, 2,39,76,20,57, 1,38
dq 75,19,56,93
.data?
.code
; parameters rcx,rdx,r8,r9
; not saved rax,rcx,rdx,r8,r9,r10,r11
; code starts on 16 byte boundary
main proc
push r15
push r14
push r13
push r12
push rbp
mov rbp,rsp
and rsp,0fffffffffffffff0h
sub rsp,64
mov r15,offset a
xor r14,r14
mov r11,0100000h
; nop padding effect on loop version (with 0 padding in padx below)
; 0 puts main2 on odd 16 byte boundary clk = 0131876622h => 1.465 seconds
; 9 puts main1 on odd 16 byte boundary clk = 01573FE951h => 1.645 seconds
rept 0
nop
endm
rdtsc
mov r12,rdx
shl r12,32
or r12,rax
main0: xor r10,r10
main1: mov rcx,[r10+r15]
call fib
main2: add r14,rax
add r10,8
cmp r10,8*94
jne main1
dec r11
jnz main0
rdtsc
mov r13,rdx
shl r13,32
or r13,rax
sub r13,r12
mov rdx,r14
xor rax,rax
mov rsp,rbp
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
main endp
align 16
padx proc
; nop padding effect on loop version with 0 padding above
; 0 puts fib on odd 16 byte boundary clk = 0131876622h => 1.465 seconds
; 16 puts fib on even 16 byte boundary clk = 01A13C8CB8h => 2.000 seconds
; nop padding effect on computed jump version with 9 padding above
; 0 puts fib on odd 16 byte boundary clk = 00D979792Dh => 1.042 seconds
; 16 puts fib on even 16 byte boundary clk = 00DA93E04Dh => 1.048 seconds
rept 0
nop
endm
padx endp
if 1 ;0 = loop version, 1 = computed jump version
fib proc ;rcx == n
mov r8,rcx ;set jmp adr
mov r9,offset fib0+279
lea r8,[r8+r8*2]
neg r8
add r8,r9
mov rax,rcx ;set rax,rdx
mov rdx,1
and rax,rdx
sub rdx,rax
jmp r8
fib0: ; assumes add xxx,xxx takes 3 bytes
rept 46
add rax,rdx
add rdx,rax
endm
add rax,rdx
ret
fib endp
else
fib proc ;rcx == n
mov rax,rcx ;br if < 2
cmp rax,2
jb fib1
mov rdx,1 ;set rax, rdx
and rax,rdx
sub rdx,rax
shr rcx,1
fib0: add rdx,rax
add rax,rdx
dec rcx
jnz fib0
fib1: ret
fib endp
endif
end
推荐答案
这是对原始问题的答案,即为什么在完全未使用结果的情况下,为什么循环要花费1.4倍的计算跳转版本时间. IDK确切地解释了为什么使用1周期add
循环承载的依赖链累加结果会产生如此大的差异.可以尝试做的有趣的事情:将其存储到内存中(例如,将其分配给volatile int discard
),这样,asm dep链不仅会以破坏性的寄存器结尾.硬件可能会对此进行优化(例如,一旦确定结果无效,则丢弃uops). 英特尔表示,桑迪布里奇家族可以为其中一项有标志性结果的产品做到这一点在shl reg,cl
中.
This was an answer to the original question, about why the loop takes 1.4x the time of the computed-jump version when the result is totally unused. IDK exactly why accumulating the result with a 1-cycle add
loop-carried dependency chain would make so much difference. Interesting things to try: store it to memory (e.g. assign it to a volatile int discard
) so the asm dep chain doesn't just end with a clobbered register. HW might possibly optimize that (e.g. discard uops once it's sure the result is dead). Intel says Sandybridge-family can do that for one of the flag-result uops in shl reg,cl
.
旧答案:为什么计算出的跳转比循环快1.4倍,而结果未使用
您要在此处测试吞吐量,而不是等待时间.在我们之前的讨论中,我主要关注延迟.那可能是个错误;吞吐量对调用者的影响通常与延迟一样重要,这取决于调用者之后执行的操作对结果的数据依赖性.
You're testing throughput here, not latency. In our earlier discussion, I was mostly focusing on latency. That may have been a mistake; throughput impact on the caller can often be as relevant as latency, depending on how much of what the caller does after has a data dependency on the result.
无序执行隐藏了延迟,因为一个调用的结果不是下一个调用的arg的输入依赖项.而且IvyBridge的乱序窗口足够大,可以在此处使用: 168条ROB(从发行到退役)和54个条目的调度程序(从发行到执行),以及一个160个条目的物理寄存器文件.另请参见 OOO窗口大小的PRF与ROB限制.
Out-of-order execution hides the latency because the result of one call isn't an input dependency for the arg to the next call. And IvyBridge's out-of-order window is large enough to be useful here: 168-entry ROB (from issue to retirement), and 54-entry scheduler (from issue to execute), and a 160-entry physical register file. See also PRF vs. ROB limits for OOO window size.
OOO执行还隐藏了任何Fib工作之前分支错误预测的成本. last fib(n)
dep链中的工作仍在进行中,并且在此错误预测期间仍在进行中. (现代Intel CPU仅回滚到错误预测的分支,并且在解决错误预测时,可以继续从分支之前执行uops.)
OOO execution also hides the cost of the branch-mispredict before any Fib work gets done. Work from the last fib(n)
dep chain is still in flight and being worked on during that mispredict. (Modern Intel CPUs only roll back to the mispredicted branch, and can keep executing uops from before the branch while the mispredict is being resolved.)
这里的计算分支版本很好,这是有道理的,因为您大多受到uop吞吐量的瓶颈,而来自loop-exit分支的错误预测的成本大约与进入未展开分支时的间接分支错误预测相同版本. IvB可以将sub/jcc
宏融合到端口5的单个uop中,因此40%的数字匹配得很好. (3个ALU执行单元,因此在循环开销上花费1/3或您的ALU执行吞吐量可以对此进行解释.分支错误预测的差异和OOO执行的限制可以解释其余部分)
It makes sense that the computed-branch version is good here, because you're mostly bottlenecked on uop throughput, and the mispredict from the loop-exit branch costs about the same as the indirect-branch mispredict on entry to the unrolled version. IvB can macro-fuse the sub/jcc
into a single uop for port 5, so the 40% number matches up pretty well. (3 ALU execution units, so spending 1/3 or your ALU execution throughput on loop overhead explains it. Branch-mispredict differences and the limits of OOO execution explain the rest)
我认为在大多数实际用例中,延迟可能会很重要.吞吐量可能仍然是最重要的,但是除此之外,其他任何因素都将使延迟变得更重要,因为这甚至根本不使用结果.当然,正常的做法是,在恢复间接分支错误预测后,可以在管道中进行先前的工作,但这会延迟结果的准备工作,这可能意味着如果fib()
之后开始发出uops是一件好事.
I think in most real use-cases, latency might will relevant. Maybe throughput will still be most important, but anything other than this will make latency more important, because this doesn't even use the result at all. Of course, it's normal that there will be previous work in the pipeline that can be worked on while an indirect-branch mispredict is recovered from, but this will delay the result being ready which might mean stalls later if most of the instructions after fib()
returns are dependent on the result. But if they aren't (e.g. a lot of reloads and calculations of addresses for where to put the result), having the front-end start issuing uops from after fib()
sooner is a good thing.
我认为这里的一个很好的中间立场是4或8展开,并在展开循环之前进行检查以确保它应该运行一次. (例如sub rcx,8
/jb .cleanup
).
I think a good middle ground here would be an unroll by 4 or 8, with a check before the unrolled loop to make sure it should run once. (e.g. sub rcx,8
/ jb .cleanup
).
还请注意,您的循环版本的初始值与n
有数据依赖性.在我们前面的讨论中,我指出避免这种情况会更好乱序执行,因为它允许add
链在n
准备就绪之前开始工作.我认为这不是一个很大的因素,因为调用方的n
延迟很短.但这确实使循环分支错误预测是在n
-> fib(n)
dep链的末尾而不是在中间退出循环. (我在循环后描绘了一个无分支的lea
/cmov
,如果sub ecx, 2
低于零而不是零,则会再进行一次迭代.)
Also note that your looping version has a data dependency on n
for the initial values. In our earlier discussion, I pointed out that avoiding this would be better for out-of-order execution, because it lets the add
chain start working before n
is ready. I don't think that's a big factor here, because the caller has low latency for n
. But it does put the loop-branch mispredict on exiting the loop at the end of the n
-> fib(n)
dep chain instead of in the middle. (I'm picturing a branchless lea
/ cmov
after the loop to do one more iteration if sub ecx, 2
went below zero instead of to zero.)
这篇关于X86 64位模式下的索引分支开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!