为什么在执行指针追逐时这条跳转指令如此昂贵? [英] Why is this jump instruction so expensive when performing pointer chasing?

查看:32
本文介绍了为什么在执行指针追逐时这条跳转指令如此昂贵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个执行指针追踪的程序,我正在尝试优化指针尽可能地追逐循环.我注意到 perf record 检测到函数 myFunction() 中约 20% 的执行时间用于执行跳转指令(用于在特定值后退出循环已阅读).

注意事项:

  • 指针追踪路径可以轻松放入 L1 数据缓存
  • 使用 __builtin_expect 来避免分支错误预测的成本没有明显效果

perf record 有以下输出:

样本:153K 事件周期",10000 Hz,事件计数(大约):35559166926myFunction/tmp/foobar [百分比:本地点击]百分比│ endbr64...80.09 │20: mov (%rdx,%rbx,1),%ebx0.07 │ 添加 $0x1,%rax│ cmp $0xffffffff,%ebx19.84 │ ↑ jne 20...

我希望在这个循环中花费的大部分周期都用于从内存中读取值,这是由 perf 确认的.我还希望剩余的周期在执行循环中剩余的指令时会有些均匀地花费.相反, perf 报告的是剩余周期的很大一部分用于执行跳转.

我怀疑通过了解用于执行这些指令的微操作,我可以更好地了解这些成本,但我有点不知从何开始.

解决方案

记住 cycles 事件必须选择一个指令来责备,即使 mov-load和宏融合的 cmp-and-branch uops 正在等待结果.这不是一个或另一个成本计算周期"的问题.当它运行时;他们都在并行等待.(现代微处理器90 分钟指南!https://agner.org/optimize/)

但是当循环"事件计数器溢出,它必须选择一个特定的指令来责备",因为您正在使用统计采样.在这种情况下,一个有数百个 uops 的 CPU 必须发明一张不准确的现实图片.通常是等待缓慢输入的人受到指责,我认为是因为它通常是 ROB 或 RS 中最老的并且阻止前端分配新的 uop.

具体选择哪条指令的细节可能会告诉我们一些关于 CPU 内部的信息,但只是非常间接的.可能与它如何退出 4(?) uop 组有关,并且此循环有 3 个,因此在发生 perf 事件异常时哪个 uop 最旧.

出于某种原因,4:1 拆分可能很重要,也许是因为 4+1 = 5 周期延迟,加载具有非简单寻址模式.(我假设这是一个 Intel Sandybridge 系列 CPU,也许是 Skylake 派生的?)就像也许如果数据在与 perf 事件溢出(并选择采样)的同一周期从缓存到达,mov不会受到指责,因为它实际上可以执行并让开?

IIRC、BeeOnRope 或其他人通过实验发现,Skylake CPU 倾向于让最旧的未退休指令在异常到达后退出,至少在它不是缓存未命中的情况下.在您的情况下,这将是循环底部的 cmp/jne,它按程序顺序出现在下一次迭代顶部的加载之前.

I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible. I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).

Some things to take note:

  • the pointer chasing path can comfortably fit in the L1 data cache
  • using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect

perf record has the following output:

Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926                                                                                                                                                               
myFunction  /tmp/foobar [Percent: local hits]                                                                                                                                                                            
Percent│      endbr64                                                                                                                                                                                                                       
      ...
 80.09 │20:   mov     (%rdx,%rbx,1),%ebx                                                                                                                                                                                                    
  0.07 │      add     $0x1,%rax                                                                                                                                                                                                             
       │      cmp     $0xffffffff,%ebx                                                                                                                                                                                                      
 19.84 │    ↑ jne     20                                                                                                                                                                                                                    
      ...

I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf. I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.

I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.

解决方案

Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors A 90-Minute Guide! and https://agner.org/optimize/)

But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.

The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.

The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?

IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.

这篇关于为什么在执行指针追逐时这条跳转指令如此昂贵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆