X86预取优化:线程代码 [英] X86 prefetching optimizations: "computed goto" threaded code

查看:141
本文介绍了X86预取优化:线程代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当不平凡的问题,其中我的计算图具有循环和多个计算路径".我没有在每个顶点都被一个称为一个顶点的调度程序循环中进行操作,而是想到了将所有预先分配的框架对象"放置在堆中(代码和数据).
这有点类似于线程代码(甚至更好:CPS),只是在堆上跳来跳去,执行代码.每个代码段在堆中都与自己的帧指针"相关联,并使用与此相关的数据.帧始终保持分配状态.该代码仅在已知位置产生副作用,计算(如有必要)下一个goto值并跳转到该位置.
我还没有尝试过(这是一项正确的重大工作,并且我完全意识到所有的困难),所以我想问问x86机械方面的专家:它能比调度程序循环快吗?我知道在硬件中对呼叫/重拨指令进行了多种优化.
相对于堆栈指针或任何其他指针的访问数据之间有区别吗?是否存在用于间接跳转的预取(跳转到寄存器中存储的值?).
这个想法甚至可行吗?

P.S.如果您已阅读此书,但仍无法理解我的意思(请原谅我尝试解释问题的尝试),请把这个整体想象成一堆在堆上预先分配的协程彼此屈服.进程中没有使用标准的x86堆栈,因为所有内容都在堆上.

解决方案

直接在块之间跳转通常是分支预测的胜利,而不是返回一个父级间接分支,尤其是在Intel Haswell之前的CPU上.


随着每个块尾部的跳转,每个分支都有不同的分支预测器历史记录.给定的块通常跳到相同的下一个块,或者具有几个目标地址的简单模式,这很常见.通常可以很好地预测到这一点,因为每个分支分别具有一个更简单的模式,并且分支历史记录分布在多个分支上.

如果所有调度都是从单个间接分支发生的,则可能只有一个BTB(分支目标缓冲区)条目,并且这种模式太复杂了,难以预测.

Intel Haswell中的现代TAGE分支预测变量,后来使用最近的分支历史记录(包括间接分支目标)为BTB编制索引,实际上可以解决此问题.在在X86 64位模式下建立索引分支开销,然后在 https://danluu.com/branch-prediction/中搜索Haswell

具体来说, 分支预测和口译员的表现- Rohou,Swamy和Seznec撰写的《不要相信民俗》(a)(2015),将Nehalem,SandyBridge和Haswell的解释器基准进行了比较,并使用单个switch语句测量了调度循环的实际错误预测率.他们发现Haswell的效果要好得多,可能使用了ITTAGE预测器.

他们不测试AMD CPU. 自打桩机以来,AMD已使用用于分支预测的感知器神经网络在其CPU上发布了一些信息 .我不知道他们用一个间接分支处理调度循环的情况如何.


Darek Mihocka 在解释性CPU仿真器的上下文中讨论了这种模式.对于不同的指令(或简化的uops),从处理程序的一个块跳转到另一个块.他详细介绍了Core2,Pentium4和AMD Phenom上各种策略的性能. (它写于2008年).当前CPU上的现代分支预测器最类似于Core2.

他最终以分支预测友好的方式提出了他所谓的Nostradamus Distributor模式,用于检查提早退出(功能返回功能指针或防火梯"标记).如果您不需要这样做,请参阅文章的前半部分,他在其中谈论区块链与中央分配器之间的跳转的直接链接.

他甚至抱怨x86中缺少代码预取指令.与奔腾4相比,这可能是一个更大的问题,在奔腾4中,与从跟踪缓存中运行相比,填充跟踪缓存的初始解码非常慢. Sandybridge系列具有已解码的uop缓存,但它不是跟踪缓存,并且解码器仍然足够强大,以至于在uop缓存未命中时也不会吸收. Ryzen与此相似.

相对于堆栈指针或其他任何指针,访问数据之间有区别吗?

不.您甚至可以在跳转后设置rsp,以便每个块都可以拥有自己的堆栈.如果安装了任何信号处理程序,则rsp需要指向有效的内存.另外,如果您希望能够call任何普通的库函数,则需要rsp作为堆栈指针,因为它们将要ret.

是否存在用于间接跳转的预取(跳转到寄存器中存储的值?).

如果您准备好执行间接跳转之前很早就知道分支目标地址,则将其预取到L2可能会有用.当前所有x86 CPU使用分离的L1I/L1D缓存,因此prefetcht0会污染L1D而不会获得任何收益,但是> c7> 可能会有用(提取到L2和L3).或者,如果代码在L2中已经很热了,那么它可能根本就没有用.

也很有用:尽可能早地计算跳转目标地址,以便在无序内核中进行大量工作排队时,无序执行可以解析分支.这样可以最大程度地减少管道中的潜在气泡.尽可能使计算独立于其他事物.

最好的情况是jmp之前的许多指令中的寄存器地址,因此jmp一旦在执行端口上获得一个周期,就可以为前端提供正确的目的地(并重新引导)如果分支预测错误).最糟糕的情况是,分支目标是分支之前的一长条指令依赖项的结果.几个独立的指令和/或间接存储器跳转就可以了;一旦将指令放入OOO调度程序中,乱序执行应该找到运行这些指令的周期.

L1iTLB和L1dTLB也分开,但是L2TLB通常在大多数微体系结构中是统一的.但是IIRC,L2TLB充当L1 TLB的牺牲品缓存.预取可能会触发页面遍历以填充L1数据TLB中的条目,但是在某些微体系结构上,这无法避免iTLB遗漏. (至少它将页表数据本身放入L1D中,或者将其移到页面遍历硬件中的内部页面目录缓存中,因此对于同一条目进行另一次页面遍历会很快.但是由于除Intel Skylake(及更高版本)以外的CPU只有一个硬件分页浏览单元,如果在进行第一个分页浏览时发生iTLB丢失,则它可能无法立即启动,因此,如果您的代码过于分散以至于导致iTLB丢失,则实际上可能会受到伤害)

使用2MB的大页面作为JIT的内存块,以减少TLB丢失.最好将代码放在相对狭窄的区域中,并且将数据分开. DRAM局部影响是真实的. (我认为,DRAM页面通常大于4kiB,但这是硬件,您无法选择.在已经打开的页面中访问的延迟较低.)

请参见 Agner Fog的microarch pdf ,以及 x86 标签wiki的问题中查看更多链接. >

这个想法甚至可行吗?

是的.

如果可能的话,当一个块总是跳到另一个块时,请使这些块连续以消除该跳动.

数据的相对寻址很容易:x86-64具有RIP相对寻址.

您可以lea rdi, [rel some_label]然后从那里进行索引,或者只对某些静态数据直接使用相对于RIP的寻址.

您将要使您的代码或其他东西抖动,因此只需计算从当前指令的末尾到要访问的数据的带符号偏移量,这就是相对于RIP的偏移量.在x86-64中,位置无关的代码+静态数据很容易.

I have a rather non-trivial problem, where my computational graph has cycles and multiple "computational paths". Instead of making a dispatcher loop, where each vertex will be called one-by-one, I had an idea to place all pre-allocated "frame objects" in the heap (code+data).
This is somewhat analogous to threaded code (or even better: CPS), just jumping around the heap, executing code. Each code piece is associated with its own "frame pointer" in the heap and uses data relative to that. Frames remain always allocated. The code just produces side-effects at known locations, computes (if necessary) next goto value and jumps there.
I haven't tried it out yet (this will be a major undertaking to make it correct and I am fully aware of all the difficulties) so I wanted to ask experts on x86 machinery: can it be faster than a dispatcher loop? I know there are several optimizations for call/ret instructions taking place in hardware.
Is there a difference between accessing data relative to the stack pointer or any other pointer? Is there a prefetching for an indirect jump (jump to value stored in register?).
Is this idea even viable?

P.S. if you've read this and still couldn't understand what I mean by this idea (pardon my failed attempts to explain things) imagine this whole as a set of many pre-allocated coroutines on a heap which yield to each other. The standard x86 stack is not used in the process, as everything is on the heap.

解决方案

Jumping directly from block to block is often a win for branch prediction, vs. returning up to one parent indirect-branch, especially on CPUs older than Intel Haswell.


With jumps from the tail of each block, each branch has a different branch-predictor history. It's probably common for a given block to usually jump to the same next block, or a have a simple pattern of a couple target addresses. This can often be predicted well because each branch individually has a simpler pattern, and the branch history is distributed over multiple branches.

If all dispatching happens from a single indirect branch, there's may only be one BTB (branch target buffer) entry for it, and the pattern will be too complicated for it to predict well.

Modern TAGE branch predictors in Intel Haswell and later index the BTB using recent branch history, including indirect-branch destination, does actually work around this problem. See comments on Indexed branch overhead on X86 64 bit mode, and search for Haswell in https://danluu.com/branch-prediction/

Specifically, Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) by Rohou, Swamy, and Seznec compares Nehalem, SandyBridge, and Haswell on interpreter benchmarks and measures actual mispredict rate for dispatch loops with a single switch statement. They find that Haswell does much better, likely using an ITTAGE predictor.

They don't test AMD CPUs. AMD has published some info on their CPUs since Piledriver using Perceptron neural networks for branch prediction. I don't know how well they handle dispatch loops with a single indirect branch.


Darek Mihocka discusses this pattern in the context of a interpreting CPU emulator, which jumps from block to block of handlers for different instructions (or simplified uops). He goes into a lot of detail about the performance of various strategies on Core2, Pentium4, and AMD Phenom. (It was written in 2008). Modern branch predictors on current CPUs are most like the Core2.

He eventually presents what he calls the Nostradamus Distributor pattern for checking for early-out (functions return a function-pointer, or a "fire escape" sentinel), in a branch-prediction-friendly way. If you don't need that, just see the early part of the article where he talks about direct chaining of jumps between blocks vs. a central distributor.

He even bemoans the lack of a code prefetch instruction in x86. That was was probably a bigger deal with Pentium 4, where initial decode to populate the trace cache was very slow compared to running from the trace cache. Sandybridge-family has a decoded-uop cache, but it isn't a trace cache, and the decoders are still strong enough to not suck when the uop cache misses. Ryzen is similar.

Is there a difference between accessing data relative to the stack pointer or any other pointer?

No. You could even set rsp after jumping so each block can have its own stack. If you have any signal handlers installed, rsp needs to point to valid memory. Also, if you want to be able to call any normal library functions, you need rsp to work as a stack pointer, because they will want to ret.

Is there a prefetching for an indirect jump (jump to value stored in register?).

Prefetch into L2 could be useful if you know the branch target address long before you're ready to execute an indirect jump. All current x86 CPUs use split L1I / L1D caches, so prefetcht0 would pollute L1D for no gain, but prefetcht1 might be useful (fetch into L2 and L3). Or it might not be useful at all, if the code is already hot in L2.

Also useful: calculate the jump target address as early as possible, so out-of-order execution can resolve the branch while lots of work is queued up in the out-of-order core. This minimizes the potential bubble in the pipeline. Keep the calculation independent of other stuff if possible.

Best case is address in a register many instructions before the jmp, so as soon as the jmp gets a cycle on an execution port, it can provide the correct destination to the front-end (and re-steer if branch prediction got it wrong). Worst case is when the branch target is the result of a long dependency chain of instructions right before the branch. A couple independent instructions, and/or a memory-indirect jump, is fine; out-of-order execution should find cycles to run those instructions once they're in the OOO scheduler.

There are also split L1iTLB and L1dTLBs, but the L2TLB is usually unified on most microarchitectures. But IIRC, the L2TLB works as a victim cache for the L1 TLBs. A prefetch might trigger a page walk to populate an entry in the L1 data TLB, but on some microarchitectures that wouldn't help avoid an iTLB miss. (At least it would get the page table data itself into L1D or maybe internal page-directory caches in the page-walk hardware, so another page walk for the same entry would be fast. But since CPUs other than Intel Skylake (and later) only have 1 hardware page-walk unit, if the iTLB miss happens while the first page walk is still happening, it might not be able to start right away, so could actually hurt if your code is so scattered that you're getting iTLB misses.)

Use 2MB hugepages for the chunk of memory you will JIT into to reduce TLB misses. Probably best to lay out the code in a fairly tight region, with the data separate. DRAM locality effects are a real thing. (A DRAM page is usually bigger than 4kiB, I think, but it's a hardware thing and you can't choose. It's lower latency to access within an already-open page.)

See Agner Fog's microarch pdf, and also Intel's optimization manual.. (And AMD's manual too, if you are worried about AMD CPUs). See more links in the tag wiki.

Is this idea even viable?

Yes, probably.

If possible, when one block always jumps to another block, elide the jump by making the blocks contiguous.

Relative addressing for data is easy: x86-64 has RIP-relative addressing.

You can lea rdi, [rel some_label] and then index from there, or just use RIP-relative addressing directly for some of your static data.

You're going to be JITting your code or something, so just calculate signed offsets from the end of the current instruction to the data to be accessed, and that's your RIP-relative offset. Position-independent code + static data is easy in x86-64.

这篇关于X86预取优化:线程代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆