在指令周期内如何执行微码? [英] How are microcodes executed during an instruction cycle?

查看:135
本文介绍了在指令周期内如何执行微码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从开放资源中,我可以得出结论,微代码大约可以直接由CPU执行,并负责实现指令代码.另外, Wikipedia 表示每次执行指令代码都会经历fetch-decode-execute指令周期.但是,我找不到任何参考资料来说明在此三个阶段中如何执行微代码.所以我的问题是,微代码执行与指令周期之间的关系是什么?微码在指令执行的获取,解码和执行阶段如何工作?

From open resources I can conclude that microcode is approximately something that can be executed directly by CPU and is responsible for implementing instruction codes. Also Wikipedia indicates that every execution of instruction code would go through a fetch-decode-execute instruction cycle. However, I cannot find any references explaining how microcode execution is done during this three-phase cycle. So my question is, what's the relationship of microcode execution and instruction cycle? How does microcodes do their work during fetch, decode and execute phase of an instruction execution?

此外, stackoverflow anwser 表示,在现代Intel CPU中,即使是最简单的指令,例如DIVMOV也会在执行之前先用微代码进行编译,因此,如果确实如此,那么最好是有人可以用此类CPU的示例对其进行解释.

Also this stackoverflow anwser says that in modern Intel CPUs even the simplest instructions like DIV and MOV would be compiled in microcodes before executing, so it would be best if anyone could explain it with examples from such CPUs, if that is indeed true.

推荐答案

div并不简单,它是最难计算的整数运算之一!它是在Intel CPU上进行微编码的,与movadd/sub甚至是imul不同,它们在现代Intel上都是单核.有关说明表和微体系结构指南,请参见 https://agner.org/optimize/. (有趣的事实:AMD Ryzen不会对div进行微代码处理;它只有2个微指令,因为它必须写入2个输出寄存器.Piledriver和后来的版本也使32位和64位除法2个微指令.)

div is not simple, it's one of the hardest integer operations to compute! It's microcoded on Intel CPUs, unlike mov, or add/sub or even imul which are all single-uop on modern Intel. See https://agner.org/optimize/ for instruction tables and microarch guides. (Fun fact: AMD Ryzen doesn't microcode div; it's only 2 uops because it has to write 2 output registers. Piledriver and later also make 32 and 64-bit division 2 uops.)

所有指令都解码为1或更多微指令(大多数程序中的大多数指令在当前CPU上为1微指令).在Intel CPU上解码为4或更少的微指令的指令被称为未微编码",因为它们对多指令不使用特殊的MSROM机制.

All instructions decode to 1 or more uops (with most instructions in most programs being 1 uop on current CPUs). Instructions which decode to 4 or fewer uops on Intel CPUs are described as "not microcoded", because they don't use the special MSROM mechanism for many-uop instructions.

没有将x86指令解码为uops的CPU使用简单的三相提取/解码/执行周期,因此问题的前提部分没有任何意义.再次,请参阅Agner Fog的微体系结构指南.

No CPUs that decode x86 instructions to uops use a simple 3-phase fetch/decode/exec cycle, so that part of the premise of your question makes no sense. Again, see Agner Fog's microarch guide.

您确定要询问有关现代Intel CPU的问题吗?一些较旧的CPU在内部进行了微编码,尤其是非流水线CPU,在其中执行不同指令的过程可以按不同顺序激活不同的内部逻辑块. 控制该逻辑的逻辑也称为微代码,但这是与流水线无序CPU上下文中该术语的现代含义不同的微代码.

Are you sure you wanted to ask about modern Intel CPUs? Some older CPUs are internally microcoded, especially non-pipelined CPUs where the process of executing different instructions can activate different internal logic blocks in a different order. The logic that controls this is also called microcode, but it's a different kind of microcode from the modern meaning of the term in the context of a pipelined out-of-order CPU.

如果您正在寻找的是,请参见逆向处理器中的微代码是如何实现的? 在逆计算.SE上用于非流水线CPU(例如6502和Z80),其中记录了一些微代码内部定时周期.

If that's what you're looking for, see How was microcode implemented in retro processors? on retrocomputing.SE for non-pipelined CPUs like 6502 and Z80, where some of the microcode internal timing cycles are documented.

当微编码的间接uop"到达Sandybridge系列CPU的IDQ的头部时,它接管了issue/rename阶段,并从微码序列器MS-ROM中提供了信息.直到指令已发出所有微指令,前端才能继续将其他微指令发布到乱序的后端.

When a microcoded "indirect uop" reaches the head of the IDQ in a Sandybridge-family CPU, it takes over the issue/rename stage and feeds it uops from the microcode-sequencer MS-ROM until the instruction has issued all its uops, then the front-end can resume issuing other uops into the out-of-order back-end.

IDQ是指示发布/重命名阶段(将uops从前端发送到无序后端)的指令解码队列.它缓冲来自uop缓存+传统解码器的uops,以吸收气泡和突发.这是 David Kanter的Haswell框图中的56个uop队列. (但是这表明微码只能在队列中之前读取,这与英特尔对某些性能事件 1 的描述不符,或者与运行a的微码指令所发生的情况不符.数据相关的微码数.

The IDQ is the Instruction Decode Queue that feeds the issue/rename stage (which sends uops from the front-end into the out-of-order back-end). It buffers uops that come from the uop cache + legacy decoders, to absorb bubbles and bursts. It's the 56 uop queue in David Kanter's Haswell block diagram. (But that shows microcode only being read before the queue, which doesn't match Intel's description of some perf events1, or what has to happen for microcoded instructions that run a data-dependent number of uops).

(这可能不是100%准确,但至少可以作为大多数性能影响的心理模型 2 .可能对此性能还有其他解释到目前为止我们已经观察到的效果.)

(This might not be 100% accurate, but at least works as a mental model for most of the performance implications2. There might be other explanations for the performance effects we've observed so far.)

这仅在需要超过4 uops的指令时发生;需要4个或更少解码的指令以在常规解码器中分离uops,并且可以正常发出.例如xchg eax, ecx在现代Intel上为3 oups:

This only happens for instructions that need more than 4 uops; instructions that need 4 or fewer decode to separate uops in the normal decoders and can issue normally. e.g. xchg eax, ecx is 3 uops on modern Intel: Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? goes into detail about what we can figure out about what those uops actually are.

用于微代码指令的特殊间接" uop在解码后的uup缓存DSB中占据了整整一行(

The special "indirect" uop for a microcoded instruction takes a whole line to itself in the decoded-uop cache, the DSB (potentially causing code-alignment performance issue). I'm not sure if they only take 1 entry in the queue that feeds the issue stage from the uop cache and/or legacy decoders, the IDQ. Anyway, I made up the term "indirect uop" to describe it. It's really more like a not-yet-decoded instruction or a pointer into the MS-ROM. (Possibly some microcoded instructions might be a couple "normal" uops and one microcode pointer; that could explain it taking a whole uop-cache line to itself.)

我很确定它们在到达队列的开头之前不会完全扩展,因为某些微码指令的数量取决于寄存器中的数据,是可变的.尤其是rep movs,它基本上实现了memcpy.实际上,这很棘手.根据对齐方式和大小使用不同的策略,rep movs实际上需要执行一些条件分支.但是它正在跳转到不同的MS-ROM位置,而不是跳转到不同的x86机器代码位置(RIP值).参见 MSROM过程中的条件跳转指令?.

I'm pretty sure they don't fully expand until they reach the head of the queue, because some microcoded instructions are a variable number of uops depending on data in registers. Notably rep movs which basically implements memcpy. In fact this is tricky; with different strategies depending on alignment and size, rep movs actually needs to do a some conditional branching. But it's jumping to different MS-ROM locations, not to different x86 machine-code locations (RIP values). See Conditional jump instructions in MSROM procedures?.

英特尔的快速字符串专利还为P6中的原始实现提供了一些启示:首先,n复制迭代在后端进行;并指定后端时间以将ECX的值发送给MS.由此,如果需要更多的代码,微码定序器可以发送正确数量的副本,而无需在后端进行分支.也许处理几乎重叠的src和dst或其他特殊情况的机制毕竟不是基于分支的,但是Andy Glew确实提到缺乏微代码分支预测是实现的问题.因此,我们知道它们很特别.那在P6天内又回来了; rep movsb现在更加复杂.

Intel's fast-strings patent also sheds some light on the original implementation in P6: first n copy iterations are predicated in the back-end; and give the back-end time to send the value of ECX to the MS. From that, the microcode sequencer can send exactly the right number of copy uops if more are needed, with no branching in the back-end needed. Maybe the mechanism for handling nearly-overlapping src and dst or other special cases aren't based on branching after all, but Andy Glew did mention lack of microcode branch prediction as an issue for the implementation. So we know they are special. And that was back in P6 days; rep movsb is more complicated now.

根据指令的不同,它可能会也可能不会耗尽无序后端的保留站(即调度程序),同时整理出要执行的操作. rep movs对大于96字节的副本执行此操作不幸的是,在Skylake上(根据我对性能计数器的测试,将rep movs放在了imul的独立链之间).这可能是由于预测不正确的微代码分支所致,与常规分支不同.也许分支丢失快速恢复对它们不起作用,因此直到它们退休后才被发现/处理. (有关更多信息,请参见微码分支Q& A.)

Depending on the instruction, it might or might not drain the out-of-order back end's reservation station aka scheduler while sorting out what to do. rep movs does that for copies > 96 bytes on Skylake, unfortunately (according to my testing with perf counters, putting rep movs between independent chains of imul). This might be due to mispredicted microcode branches, which aren't like regular branches. Maybe branch-miss fast-recovery doesn't work on them, so they aren't detected / handled until they reach retirement? (See the microcode branch Q&A for more about this).

rep movsmov 非常不同.普通mov(如mov eax, [rdi + rcx*4])即使在复杂的寻址模式下也是单个uop. mov存储区是1个微融合的uop,包括可以按任意顺序执行的存储地址和存储数据uop,将数据和物理地址写入存储缓冲区,以便存储区可以在指令后提交到L1d从乱序的后端退休,并且变得没有投机性. rep movs的微码将包含许多加载和存储uops.

rep movs is very different from mov. Normal mov like mov eax, [rdi + rcx*4] is a single uop even with a complex addressing mode. A mov store is 1 micro-fused uop, including both a store-address and store-data uop that can execute in either order, writing the data and physical address into the store buffer so the store can commit to L1d after the instruction retires from the out-of-order back-end and becomes non-speculative. The microcode for rep movs will include many load and store uops.

脚注1 :

我们知道Skylake上有类似idq.ms_dsb_cycles的性能事件:

We know there are perf events like idq.ms_dsb_cycles on Skylake:

[当微码序列器(MS)忙时,将由解码流缓冲区(DSB)发起的指令传送到指令解码队列(IDQ)的周期]

[Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser[sic] (MS) is busy]

如果微码仅仅是送入IDQ前端的第三个可能的微指令源,那将毫无意义.但是有一个事件的描述听起来像这样:

That would make no sense if microcode is just a 3rd possible source of uops to feed into the front of the IDQ. But then there's an event whose descriptions sounds like that:

idq.ms_switches
[来自DSB(解码流缓冲区)或MITE(旧版)的开关数量 解码管道)到微码定序器]

idq.ms_switches
[Number of switches from DSB (Decode Stream Buffer) or MITE (legacy decode pipeline) to the Microcode Sequencer]

我认为这实际上意味着当发问/重命名阶段切换到从微码定序器而不是IDQ (用于保存DSB和/或MITE的uops)中获取uops时,它才是重要的.并不是说 IDQ 会切换其输入uops的来源.

I think this actually means it counts when the issue/rename stage switches to taking uops from the microcode sequencer instead of the IDQ (which holds uops from DSB and/or MITE). Not that the IDQ switches its source of incoming uops.

脚注2 :

为验证该理论,我们可以构建一个测试案例,在微编码指令之后,通过很多容易预测的跳转到冷的i-cache行,并查看前端在缓存未命中和排队到uop后进入多远大型rep scasb执行期间IDQ和其他内部缓冲区.

To test this theory, we could construct a test case with lots of easily-predicted jumps to cold i-cache lines after a microcoded instruction, and see how far the front-end gets in following cache misses and queueing up uops into the IDQ and other internal buffers during the execution of a big rep scasb.

SCASB不支持快速字符串,因此它非常慢并且每个周期都不会占用大量内存.我们希望它能在L1d达到目标,因此计时是高度可预测的.大概4k页足以让前端有足够的时间跟踪许多i缓存未命中事件.我们甚至可以将连续的虚拟页面映射到相同的物理页面(例如,从用户空间中的文件中带有mmap的页面)

SCASB doesn't have fast-strings support, so it's very slow and doesn't touch a huge amount of memory per cycle. We want it to hit in L1d so timing is highly predictable. Probably a couple 4k pages are enough time for the front-end to follow a lot of i-cache misses. We can even map contiguous virtual pages to the same physical page (e.g. from user-space with mmap on a file)

如果微代码指令后面的IDQ空间在执行时可以被以后的指令所填充,那么这将为前端留出更多空间,以便前端可以在需要时从更多的i高速缓存行中获取数据.然后,我们可以有希望地检测到总的周期和/或其他性能计数器的差异,以运行rep scasb加上一系列跳转.在每次测试之前,在包含跳转说明的行上使用clflushopt.

If the IDQ space behind the microcoded instruction can be filled up with later instructions while it's executing, that leaves more room for the front-end to fetch from more i-cache lines ahead of when they're needed. We can then hopefully detect the difference with total cycles and/or other perf counters, for running rep scasb plus a sequence of jumps. Before each test, use clflushopt on the lines holding the jump instructions.

为了以这种方式测试rep movs,我们也许可以使用虚拟内存来技巧来将连续的页面映射到同一物理页面,这再次使我们获得了加载和存储的L1d命中率,但是dTLB的延迟将很难控制.甚至在无填充模式下以CPU引导,但这很难使用,并且需要自定义内核"才能将结果放在可见的位置.

To test rep movs this way, we could maybe play tricks with virtual memory to get contiguous pages mapped to the same physical page, again giving us L1d hits for loads + stores, but dTLB delays would be hard to control. Or even boot with the CPU in no-fill mode, but that's very hard to use and would need a custom "kernel" to put the result somewhere visible.

我非常有信心,在微码指令接管前端(如果尚未满载)的情况下,我们会发现微指令进入IDQ.有一个表演事件

I'm pretty confident we would find uops entering the IDQ while a microcoded instruction has taken over the front-end (if it wasn't already full). There is a perf event

idq.ms_uops
[当微码传送到指令解码队列(IDQ)时, Sequenser(MS)忙]

idq.ms_uops
[Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy]

和其他2个事件,例如仅计数来自MITE(传统解码)的uops或来自DSB(uop缓存)的uops的事件.英特尔对这些事件的描述与我对微码指令(间接uop")如何接管发布阶段以从微码定序器/ROM中读取uops的描述兼容,而前端的其余部分继续执行将uops传递给IDQ的另一端,直到它填满为止.

and 2 other events like that which count only uops coming from MITE (legacy decode) or uops coming from DSB (uop cache). Intel's description of those events is compatible with my description of how a microcoded instruction ("indirect uop") takes over the issue stage to read uops from the microcode sequencer / ROM while the rest of the front-end continues doing its thing delivering uops to the other end of the IDQ until it fills up.

这篇关于在指令周期内如何执行微码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆