MSROM过程中的条件跳转指令? [英] Conditional jump instructions in MSROM procedures?

查看:173
本文介绍了MSROM过程中的条件跳转指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与问题

考虑一下,在现代的Intel CPU上,SEC阶段以微代码实现,这意味着将进行检查,从而使用烧入的密钥来验证PEI ACM上的签名。如果不匹配,则需要执行某些操作;如果不匹配,则需要执行其他操作。鉴于这是作为MSROM过程实现的,因此必须有一种分支方式,但是鉴于MSROM指令没有RIP。

Thinking about it though, on a modern intel CPU the SEC phase is implemented in microcode meaning there would be a check whereby a burned in key is used to verify the signature on the PEI ACM. If it doesn't match then it needs to do something, if it does match it needs to do something else. Given this is implemented as an MSROM procedure there must be a way of branching but given that the MSROM instructions do not have RIPs.

通常,当分支错误地预测了采用然后,当指令退出时,ROB将检查异常代码,并因此将指令长度添加到ROB行的RIP或仅使用下一个ROB条目的IP,这将导致前端在分支预测更新中恢复到该地址。有了BOB,此功能现在已借给跳转执行单元。显然,这与MSROM例程不可能发生,因为前端与此无关。

Usually, when a branch mispredicts as being taken then when the instruction retires, the ROB will check the exception code and hence add the instruction length to the RIP of the ROB line or just use the next ROB entry's IP which will result in the front end being resteered to that address amongst branch prediction updates. With the BOB, this functionality has now been lent to the jump execution units. Obviously this can't happen with an MSROM routine as the front-end has nothing to do with it.

我的想法是,有一条特定的跳转指令只能MSROM例程可能会跳转到MSROM中的其他位置,并且可以配置为始终预测不采用MSROM分支指令,并且当分支执行单元遇到该指令并采用分支时,它将产生异常代码,并且也许将特殊的跳转目标连接到它,并且在退出时发生异常。另外,执行单元可以处理它,并且可以使用BOB,但我的印象是BOB由分支指令RIP索引,然后还存在一个事实,即通常会在退休时处理生成MSROM代码的异常。分支预测错误不需要我不认为的MSROM,而是所有操作都是在内部执行的。

My thoughts would be that there is a specific jump instruction that only the MSROM routine can issue that jumps to a different location in the MSROM and it could be configured such that MSROM branch instructions are always predicted not taken and when the branch execution unit encounters this instruction and the branch is taken, it produces an exception code and perhaps concatenates the special jump destination to it and an exception occurs on retirement. Alternatively, the execution unit could take care of it and it could use the BOB but I'm under the impression that the BOB is indexed by branch instruction RIP then there's also the fact that exceptions that generate MSROM code are usually handled at retirement; a branch misprediction doesn't require the MSROM I don't think and rather all actions are preformed internally.

推荐答案

微代码分支是

根据安迪·格莱夫(Andy Glew)对原始P6的描述,

英特尔的P6和SnB系列不支持微代码分支的动态预测。 href = https://stackoverflow.com/questions/33902068/what-setup-does-rep-do/33905887#33905887> REP可以进行哪些设置?)。鉴于SnB系列 rep 字符串指令的性能相似,我认为此PPro事实甚至适用于最新的Skylake / CoffeeLake CPU 1

Intel's P6 and SnB families do not support dynamic prediction for microcode branches, according to Andy Glew's description of original P6 (What setup does REP do?). Given the similar performance of SnB-family rep-string instructions, I assume this PPro fact applies to even the most recent Skylake / CoffeeLake CPUs1.

但是微代码分支的错误预测会受到惩罚,因此它们是静态预测的? (这就是为什么 rep movsb 的启动成本对于ECX中的低/中/高计数以5个周期为增量,并且对齐还是未对齐的原因。)

But there is a penalty for microcode branch misprediction, so they are statically(?) predicted. (This is why rep movsb startup cost goes in increments of 5 cycles for low/medium/high counts in ECX, and aligned vs. misaligned.)

微码指令在uop缓存中以整行表示。 到达IDQ的最前面时,它将接管问题/重命名阶段,直到完成发出微码联动为止。(另请参见如何在指令周期内执行微码?以获取更多详细信息,以及一些来自perf事件描述的证据,例如 idq .dsb_uops 表示IDQ可以从uop缓存接受新的uops,而问题/重命名阶段正在从微码序列器中读取。)

A microcoded instruction takes a full line to itself in the uop cache. When it reaches the front of the IDQ, it takes over the issue/rename stage until it's done issuing microcode uops. (See also How are microcodes executed during an instruction cycle? for more detail, and some evidence from perf event descriptions like idq.dsb_uops that show the IDQ can be accepting new uops from the uop cache while the issue/rename stage is reading from the microcode-sequencer.)

对于 rep 字符串指令,我认为循环的每次迭代都必须通过前端实际发出,而不仅仅是循环在后端内部并重用这些微指令。因此,这涉及到OoO后端的反馈,以找出指令何时完成执行。

For rep-string instructions, I think each iteration of the loop has to actually issue through the front-end, not just loop inside the back-end and reuse those uops. So this involves feedback from the OoO back-end to find out when the instruction is finished executing.

我不知道发出问题/重命名切换时会发生什么情况的详细信息

I don't know the details of what happens when issue/rename switches over to reading uops from the MS-ROM instead of the IDQ.

即使每个uop没有自己的RIP(属于单个微编码指令),也可以从MS-ROM而不是IDQ读取uop。我猜想分支错误预测检测机制的工作原理与普通分支类似。

Even though each uop doesn't have its own RIP (being part of a single microcoded instruction), I'd guess that the branch mispredict detection mechanism works similarly to normal branches.

rep movs 在某些时候设置时间根据情况(小与大,对齐等),CPU似乎以5个周期为步长。如果这些是来自微码分支的错误预测,则这似乎意味着错误预测惩罚是固定数量的周期,除非这只是 rep movs 的特殊情况。可能是因为OoO后端可以跟上前端吗?而且,从MS-ROM读取比从uop缓存读取甚至更能缩短路径,从而降低了未命中率。

rep movs setup times on some CPUs seem to go in steps of 5 cycles depending on which case it is (small vs. large, alignment, etc). If these are from microcode branch mispredict, that would appear to mean that the mispredict penalty is a fixed number of cycles, unless that's just a special case of rep movs. May be because the OoO back-end can keep up with the front-end? And reading from the MS-ROM shortens the path even more than reading from the uop cache, making the miss penalty that low.

运行将很有趣关于 rep movsb 可能有多少OoO执行程序的一些实验,例如带有两条从属的 imul 指令链,以查看它是否(部分)将它们序列化为 fence 。我们希望不会,但是要实现ILP,稍后的 imul 指令必须发布而不必等待后端耗尽。

It would be interesting to run some experiments into how much OoO exec is possible around rep movsb, e.g. with two chains of dependent imul instructions, to see if it (partially) serializes them like lfence. We hope not, but to achieve ILP the later imul uops would have to issue without waiting for the back-end to drain.

我在Skylake(i7-6700k)上进行了一些实验。初步结果:95字节及更少字节的副本大小便宜,并且被IMUL链的延迟所掩盖,但它们基本上完全重叠。 副本大小为96字节或更多的字节会耗尽RS,将两个IMUL链序列化。 RCX = 95是否为 rep movsb 都没关系与96或 rep movsd 相比,RCX = 23对24。请参见评论中的讨论,以获取我的发现的更多摘要;如果我有时间的话,我会发布更多详细信息。

I did some experiments here on Skylake (i7-6700k). Preliminary result: copy sizes of 95 bytes and less are cheap and hidden by the latency of the IMUL chains, but they do basically fully overlap. Copy sizes of 96 bytes or more drain the RS, serializing the two IMUL chains. It doesn't matter whether it's rep movsb with RCX=95 vs. 96 or rep movsd with RCX=23 vs. 24. See discussion in comments for some more summary of my findings; if I find time I'll post more details.

用 $code> rs_events.empty_end:u 来衡量 dras the RS的行为。 code>甚至变为每 rep movsb 1,而不是〜0.003。 other_assists.any:u 为零,因此它不是助手,或者至少不算作一个。

The "drains the RS" behaviour was measured with the rs_events.empty_end:u even becoming 1 per rep movsb instead of ~0.003. other_assists.any:u was zero, so it's not an "assist", or at least not counted as one.

如果微代码分支不支持通过BoB进行快速恢复,也许涉及到的任何uop只会在退休时检测到错误的预测? 96字节阈值可能是某些替代策略的临界点。 RCX = 0也会耗尽RS,大概是因为这也是一种特殊情况。

Perhaps whatever uop is involved only detects a mispredict when reaching retirement, if microcode branches don't support fast recovery via the BoB? The 96 byte threshold is probably the cutoff for some alternate strategy. RCX=0 also drains the RS, presumably because it's also a special case.

使用 rep scas (不支持快速字符串,只是微弱而笨拙的微代码。)

Would be interesting to test with rep scas (which doesn't have fast-strings support, and is just slow and dumb microcode.)

英特尔1994年的快速字符串专利描述了P6中的实现。它没有IDQ(因此在阶段之间具有缓冲区和uop缓存的现代CPU会有一些变化是有意义的),但是它们描述的避免分支的机制很简洁,也许仍用于现代ERMSB:最初的 n 个复制迭代是后端的opop,因此可以无条件地发出。还有一个uop,会导致后端将其ECX值发送到微码定序器,然后使用它在正确数量的额外复制迭代中进行输入。只是拷贝uops(可能是ESI,EDI和ECX的更新,或者可能只是在中断或异常上这样做),而不是微代码分支uops。

Intel's 1994 Fast Strings patent describes the implementation in P6. It doesn't have an IDQ (so it makes sense that modern CPUs that do have buffers between stages and a uop cache will have some changes), but the mechanism they describe for avoiding branches is neat and maybe still used for modern ERMSB: the first n copy iterations are predicated uops for the back-end, so they can be issued unconditionally. There's also a uop that causes the back-end to send its ECX value to the microcode sequencer, which uses that to feed in exactly the right number of extra copy iterations after that. Just the copy uops (and maybe updates of ESI, EDI, and ECX, or maybe only doing that on an interrupt or exception), not microcode-branch uops.

最初的 n 与读取RCX之后的更多输入可能是我所看到的96字节阈值;每个 rep movsb 每增加一个 idq.ms_switches:u (从4增至5)。

This initial n uops vs. feeding in more after reading RCX could be the 96-byte threshold I was seeing; it came with an extra idq.ms_switches:u per rep movsb (up from 4 to 5).

https://eprint.iacr.org/ 2016 / 086.pdf 建议在某些情况下,微码可以触发辅助,这可能是复制较大尺寸文件的现代机制,并且可以解释耗尽RS(显然是ROB)的原因,因为仅当uop是 committed (已退休)时才触发,因此它就像一个没有快速恢复的分支。

https://eprint.iacr.org/2016/086.pdf suggests that microcode can trigger an assist in some cases, which might be the modern mechanism for larger copy sizes and would explain draining the RS (and apparently ROB), because it only triggers when the uop is committed (retired), so it's like a branch without fast-recovery.


执行单元可以通过将事件代码与微操作的结果相关联来发出帮助或发出故障信号。提交微操作(第2.10节)时,事件代码使乱序调度程序挤压ROB中所有进行中的微操作。事件代码被转发到微码定序器,后者在相应的事件处理程序中读取微操作。

The execution units can issue an assist or signal a fault by associating an event code with the result of a micro- op. When the micro-op is committed (§ 2.10), the event code causes the out-of-order scheduler to squash all the micro-ops that are in-flight in the ROB. The event code is forwarded to the microcode sequencer, which reads the micro-ops in the corresponding event handler"

P6专利是,这种辅助请求可以在以后的指令中已经发布了一些非微码的uops之后发生,这是因为预期该微码指令只有第一批uops才是完整的。

The difference between this and the P6 patent is that this assist-request can happen after some non-microcode uops from later instructions have already been issued, in anticipation of the microcoded instruction being complete with only the first batch of uops. Or if it's not the last uop in a batch from microcode, it could be used like a branch for picking a different strategy.

但这就是为什么它必须刷新ROB的原因。

But that's why it has to flush the ROB.

我对P6专利的印象是,对MS的反馈发生在以后的指令发出uops之前,如果需要的话及时发布更多的MS uops。如果我错了,那么也许已经

My impression of the P6 patent is that the feedback to the MS happens before issuing uops from later instructions, in time for more MS uops to be issued if needed. If I'm wrong, then maybe it's already the same mechanism still described in the 2016 paper.


通常,当分支错误地预测为然后被当指令退休

自从Nehalem获得快速恢复以来,英特尔就开始了恢复,当错误预测时开始恢复分支执行 ,而不是像异常一样等待其退休。

Intel since Nehalem has had "fast recovery", starting recovery when a mispredicted branch executes, not waiting for it to reach retirement like an exception.

这是拥有分支订单的关键-在常规ROB停用状态之上的缓冲区,当任何其他类型的意外事件变为非推测性事件时,该缓冲区使您可以回滚。 (当Skylake CPU错误预测分支时会发生什么?

This is the point of having a Branch-Order-Buffer on top of the usual ROB retirement state that lets you roll back when any other type of unexpected event becomes non-speculative. (What exactly happens when a skylake CPU mispredicts a branch?)

脚注1 :IceLake应该具有快速短路rep功能,这可能是处理 rep 字符串的另一种机制,而不是更改微码。例如也许像安迪(Andy)那样的硬件状态机提到他希望自己设计一开始。

Footnote 1: IceLake is supposed to have the "fast short rep" feature, which might be a different mechanism for handling rep strings, rather than a change to microcode. e.g. maybe a HW state machine like Andy mentions he wished he'd designed in the first place.

我没有任何有关性能特征的信息,但是一旦我们知道了一些我们也许可以对新的实现方式做出一些猜测。

I don't have any info on performance characteristics, but once we know something we might be able to make some guesses about the new implementation.

这篇关于MSROM过程中的条件跳转指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆