能否制造出支持多个 ISA 的处理器?(例如:ARM + x86) [英] Could a processor be made that supports multiple ISAs? (ex: ARM + x86)

查看:16
本文介绍了能否制造出支持多个 ISA 的处理器?(例如:ARM + x86)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

英特尔自 Skylake(?) 架构以来一直在内部将 CISC 指令解码为 RISC 指令,而 AMD 自其 K5 处理器以来一直在这样做.那么这是否意味着 x86 指令在执行过程中被转换为一些奇怪的内部 RISC ISA?如果这是正在发生的事情,那么我想知道是否有可能创建一个能够理解(即在内部转换为自己的专有指令)x86 和 ARM 指令的处理器.如果可能的话,性能会是怎样的?为什么还没有完成?

Intel has been internally decoding CISC instructions to RISC instructions since their Skylake(?) architecture and AMD has been doing so since their K5 processors. So does this mean that the x86 instructions get translated to some weird internal RISC ISA during execution? If that is what is happening, then I wonder if its possible to create a processor that understands (i.e, internally translates to its own proprietary instructions) both x86 and ARM instructions. If that is possible, what would the performance be like? And why hasn't it been done already?

推荐答案

ISA 越不同,就越难.而且它会花费更多的开销,尤其是后端.这不像将不同的前端贴在一个常见的后端微架构设计上那么容易.

The more different the ISAs, the harder it would be. And the more overhead it would cost, especially the back-end. It's not as easy as slapping a different front-end onto a common back-end microarchitecture design.

如果只是不同解码器的裸片面积成本,而不是其他功率或性能差异,那么在如今晶体管预算庞大的情况下,这将是次要且完全可行的.(在芯片的关键部分占用空间,使重要的东西彼此远离仍然是一种成本,但这在前端不太可能成为问题).时钟甚至电源门控都可以完全关闭未使用的任何解码器.但正如我所说,它那么简单,因为后端必须设计为支持 ISA 的指令和其他规则/功能;CPU 不会解码为完全通用/中性的 RISC 后端.相关:为什么英特尔隐藏内部RISC 对现代英特尔设计中类似 RISC 的内部 uops 是什么样的有一些想法和信息.

If it was just a die area cost for different decoders, not other power or perf differences, that would be minor and totally viable these days, with large transistor budgets. (Taking up space in a critical part of the chip that places important things farther from each other is still a cost, but that's unlikely to be a problem in the front-end). Clock or even power gating could fully power down whichever decoder wasn't being used. But as I said, it's not that simple because the back-end has to be designed to support the ISA's instructions and other rules / features; CPUs don't decode to a fully generic / neutral RISC back-end. Related: Why does Intel hide internal RISC core in their processors? has some thoughts and info about what what the internal RISC-like uops are like in modern Intel designs.

例如,将 ARM 支持功能添加到 Skylake 会使其在运行纯 x86 代码时运行速度变慢且能效降低,并且会增加芯片面积.鉴于它的市场有限,而且需要特殊的操作系统或虚拟机管理程序软件才能利用它,这在商业上是不值得的.(尽管这可能会随着 AArch64 在 Apple 的帮助下变得更加相关而开始改变.)

Adding ARM support capability to Skylake for example would make it slower and less power-efficient when running pure x86 code, as well as cost more die area. That's not worth it commercially, given the limited market for it, and the need for special OS or hypervisor software to even take advantage of it. (Although that might start to change with AArch64 becoming more relevant thanks to Apple.)

可以同时运行 ARM 和 x86 代码的 CPU 在任何一种情况下都比只处理一个的纯设计要差得多.

A CPU that could run both ARM and x86 code would be significantly worse at either one than a pure design that only handles one.

  • 高效运行 32 位 ARM 需要支持完全谓词执行,包括加载/存储的故障抑制.(与 AArch64 或 x86 不同,它们只有 ALU-select 类型指令,例如 csinccmov/setcc,它们只是对FLAGS 以及它们的其他输入.)

  • efficiently running 32-bit ARM requires support for fully predicated execution, including fault suppression for loads / stores. (Unlike AArch64 or x86, which only have ALU-select type instructions like csinc vs. cmov / setcc that just have a normal data dependency on FLAGS as well as their other inputs.)

ARM 和 AArch64(尤其是 SIMD shuffle)有几条产生 2 个输出的指令,而几乎所有的 x86 指令只写入一个输出寄存器.因此,x86 微架构旨在跟踪读取多达 3 个输入(Haswell/Broadwell 之前的 2 个)并仅写入 1 个输出(或 1 个 reg + EFLAGS)的 uops.

ARM and AArch64 (especially SIMD shuffles) have several instructions that produce 2 outputs, while almost all x86 instructions only write one output register. So x86 microarchitectures are built to track uops that read up to 3 inputs (2 before Haswell/Broadwell), and write only 1 output (or 1 reg + EFLAGS).

x86 需要跟踪 CISC 指令的单独组件,例如内存源操作数的加载和 ALU 微指令,或内存目标的加载、ALU 和存储.

x86 requires tracking the separate components of a CISC instruction, e.g. the load and the ALU uops for a memory source operand, or the load, ALU, and store for a memory destination.

x86 需要连贯的指令缓存,并窥探修改已获取并在管道中运行的指令的存储,或某种方式来处理至少 x86 的强大的自修改代码 ISA保证(观察 x86 上的过时指令提取并自修改代码).

x86 requires coherent instruction caches, and snooping for stores that modify instructions already fetched and in flight in the pipeline, or some way to handle at least x86's strong self-modifying-code ISA guarantees (Observing stale instruction fetching on x86 with self-modifying code).

x86 需要 强序内存模型.(程序顺序+带有存储转发的存储缓冲区).你必须把它加入你的加载和存储缓冲区,所以我希望即使在运行 ARM 代码时,这样的 CPU 基本上仍然会使用 x86 更强大的内存模型.(现代英特尔 CPU 推测性地提前加载并在错误推测时清除内存顺序机器,所以也许你可以让这种情况发生,并且简单地做那些管道核弹.除非是由于错误推测造成的- 预测负载是否正在通过该线程重新加载最近的存储;当然,仍然必须正确处理.)

x86 requires a strongly-ordered memory model. (program order + store buffer with store-forwarding). You have to bake this in to your load and store buffers, so I expect that even when running ARM code, such a CPU would basically still use x86's far stronger memory model. (Modern Intel CPUs speculatively load early and do a memory order machine clear on mis-speculation, so maybe you could let that happen and simply not do those pipeline nukes. Except in cases where it was due to mis-predicting whether a load was reloading a recent store by this thread or not; that of course still has to be handled correctly.)

纯 ARM 可以具有更简单的加载/存储缓冲区,它们之间不会有太多的交互.(除了为了使 stlr/ldapr/ldar 发布/获取/获取-seq-cst 更便宜,而不仅仅是完全停止.)

A pure ARM could have simpler load / store buffers that didn't interact with each other as much. (Except for the purpose of making stlr / ldapr / ldar release / acquire / acquire-seq-cst cheaper, not just fully stalling.)

不同的页表格式.(您可能会为操作系统选择一个或另一个,并且只支持本机内核下用户空间的另一个 ISA.)

Different page-table formats. (You'd probably pick one or the other for the OS to use, and only support the other ISA for user-space under a native kernel.)

如果您确实尝试完全处理来自两个 ISA 的特权/内核内容,例如因此,您可以使用任一 ISA 的 VM 进行硬件虚拟化,您还可以使用诸如控制寄存器和调试工具之类的东西.

If you did try to fully handle privileged / kernel stuff from both ISAs, e.g. so you could have HW virtualization with VMs of either ISA, you also have stuff like control-register and debug facilities.

更新:Apple M1确实支持强大的 x86 风格 TSO 内存模型,允许x86的高效+正确二进制翻译-64机器码转换成AArch64机器码,不需要每次加载和存储都使用ldapr/stlr.它还具有运行原生 AArch64 代码的弱模式,可由内核切换.

Update: Apple M1 does support a strong x86-style TSO memory model, allowing efficient+correct binary translation of x86-64 machine code into AArch64 machine code, without needing to use ldapr / stlr for every load and store. It also has a weak mode for running native AArch64 code, toggleable by the kernel.

在 Apple 的 Rosetta 二进制翻译中,软件处理了我提到的所有其他问题;CPU 只是执行本机 AArch64 机器代码.(而且 Rosetta 只处理用户空间程序,因此甚至不需要模拟 x86 页表格式和类似的语义.)

In Apple's Rosetta binary translation, software handles all the other issues I mentioned; the CPU is just executing native AArch64 machine code. (And Rosetta only handles user-space programs, so there's no need to even emulate x86 page-table formats and semantics like that.)

这已经存在于其他 ISA 组合中,特别是 AArch64 + ARM,但 x86-64 和 32 位 x86 的机器代码格式略有不同,并且寄存器集更大.这些对 ISA 当然被设计为兼容,并且新 ISA 的内核支持将旧 ISA 作为用户空间进程运行.

This already exists for other combinations of ISAs, notably AArch64 + ARM, but also x86-64 and 32-bit x86 have slightly different machine code formats, and a larger register set. Those pairs ISAs were of course designed to be compatible, and for kernels for the new ISA to have support for running the older ISA as user-space processes.

在最简单的一端,我们有 x86-64 CPU,它支持在 64 位内核下运行 32 位 x86 机器代码(在兼容模式"下).它们对所有模式完全使用相同的管道获取/解码/发布/乱序执行管道.64 位 x86 机器代码有意与 16 位和 32 位模式足够相似,可以使用相同的解码器,只有少数模式相关的解码差异.(就像 inc/dec 与 REX 前缀.)不幸的是,AMD 故意非常保守,在 64 位模式下保留了许多较小的 x86 缺陷,以保持解码器尽可能相似.(也许如果 AMD64 甚至没有流行起来,他们不想被困在花费人们不会使用的额外晶体管上.)

At the easiest end of the spectrum, we have x86-64 CPUs which support running 32-bit x86 machine code (in "compat mode") under a 64-bit kernel. They fully use the same pipeline fetch/decode/issue/out-of-order-exec pipeline for all modes. 64-bit x86 machine code is intentionally similar enough to 16 and 32-bit modes that the same decoders can be used, with only a few mode-dependent decoding differences. (Like inc/dec vs. REX prefix.) AMD was intentionally very conservative, unfortunately, leaving many minor x86 warts unchanged for 64-bit mode, to keep decoders as similar as possible. (Perhaps in case AMD64 didn't even catch on, they didn't want to be stuck spending extra transistors that people wouldn't use.)

AArch64 和 ARM 32 位是独立的机器代码格式,在编码方面存在显着差异.例如立即操作数的编码方式不同,我假设大多数操作码都不同.大概流水线有 2 个独立的解码器块,前端根据模式通过一个或另一个路由指令流.两者都相对容易解码,不像 x86,所以这大概没问题;两个块都不必很大才能将指令转换为一致的内部格式.不过,支持 32 位 ARM 确实意味着以某种方式在整个管道中实现对预测的有效支持.

AArch64 and ARM 32-bit are separate machine-code formats with significant differences in encoding. e.g. immediate operands are encoded differently, and I assume most of the opcodes are different. Presumably pipelines have 2 separate decoder blocks, and the front-end routes the instruction stream through one or the other depending on mode. Both are relatively easy to decode, unlike x86, so this is presumably fine; neither block has to be huge to turn instructions into a consistent internal format. Supporting 32-bit ARM does mean somehow implementing efficient support for predication throughout the pipeline, though.

早期的安腾 (IA-64) 也有对 x86 的硬件支持,定义了 x86 寄存器状态如何映射到 IA-64 寄存器状态.这些 ISA 完全不同.我的理解是 x86 支持或多或少是固定的",芯片的一个单独区域专用于运行 x86 机器代码.性能很差,比好的软件模拟还差,所以一旦准备好,硬件设计就放弃了它.(https://en.wikipedia.org/wiki/IA-64#Architectural_changes)

Early Itanium (IA-64) also had hardware support for x86, defining how the x86 register state mapped onto the IA-64 register state. Those ISAs are completely different. My understanding was that x86 support was more or less "bolted on", with a separate area of the chip dedicated to running x86 machine code. Performance was bad, worse than good software emulation, so once that was ready the HW designs dropped it. (https://en.wikipedia.org/wiki/IA-64#Architectural_changes)

这是否意味着 x86 指令在执行过程中被翻译成某种奇怪的内部 RISC ISA?

So does this mean that the x86 instructions get translated to some weird internal RISC ISA during execution?

是的,但是那个RISC ISA"与ARM不相似.例如它具有 x86 的所有怪癖,例如如果移位计数为 0,则移位使 FLAGS 保持不变.(现代英特尔通过将 shl eax, cl 解码为 3 uops 来处理该问题;Nehalem 和更早的版本停止了前端如果后面的指令想要从一个班次中读取 FLAGS.)

Yes, but that "RISC ISA" is not similar to ARM. e.g. it has all the quirks of x86, like shifts leaving FLAGS unmodified if the shift count is 0. (Modern Intel handles that by decoding shl eax, cl to 3 uops; Nehalem and earlier stalled the front-end if a later instruction wanted to read FLAGS from a shift.)

可能需要支持的后端怪癖的更好示例是 x86 部分寄存器,例如写入 AL 和 AH,然后读取 EAX.后端的 RAT(寄存器分配表)必须跟踪所有这些,并发出合并 uops 或以其他方式处理它.(请参阅为什么 GCC 不使用部分寄存器?).

Probably a better example of a back-end quirk that needs to be supported is x86 partial registers, like writing AL and AH, then reading EAX. The RAT (register allocation table) in the back-end has to track all that, and issue merging uops or however it handles it. (See Why doesn't GCC use partial registers?).

这篇关于能否制造出支持多个 ISA 的处理器?(例如:ARM + x86)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆