如果寄存器如此之快,为什么我们没有更多的寄存器呢? [英] If registers are so blazingly fast, why don't we have more of them?

查看:106
本文介绍了如果寄存器如此之快,为什么我们没有更多的寄存器呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在32位中,我们有8个通用"寄存器.使用64位时,数量增加了一倍,但似乎与64位更改本身无关.
现在,如果寄存器是如此之快(无内存访问),为什么自然不存在更多的寄存器呢? CPU建设者不应该在CPU中使用尽可能多的寄存器吗?为什么我们只有数量的逻辑限制是什么?

In 32bit, we had 8 "general purpose" registers. With 64bit, the amount doubles, but it seems independent of the 64bit change itself.
Now, if registers are so fast (no memory access), why aren't there more of them naturally? Shouldn't CPU builders work as many registers as possible into the CPU? What is the logical restriction to why we only have the amount we have?

推荐答案

有很多原因,您不仅仅拥有大量的寄存器:

There's many reasons you don't just have a huge number of registers:

  • 它们与大多数流水线阶段紧密相关.对于初学者,您需要跟踪其寿命,并将结果转发回先前的阶段.复杂性变得非常棘手,涉及的电线数量(字面)以相同的速度增长.这在面积上很昂贵,这最终意味着在某个点之后它在功率,价格和性能上都是昂贵的.
  • 它占用指令编码空间. 16个寄存器占用源和目的地的4位,如果您有3个操作数指令(例如ARM),则占用另外4位.仅用于指定寄存器就占用了大量指令集编码空间.这最终会影响解码,代码大小并再次影响复杂性.
  • 有更好的方法可以达到相同的结果...

这几天,我们确实有很多寄存器-只是没有明确地编程.我们有注册重命名".虽然您仅访问一个小集合(8-32个寄存器),但实际上它们由一个更大的集合(例如64-256个)支持.然后,CPU跟踪每个寄存器的可见性,并将它们分配给重命名的集.例如,您可以连续多次加载,修改然后存储到寄存器,并根据缓存未命中等情况使这些操作实际独立执行.在ARM中:

These days we really do have lots of registers - they're just not explicitly programmed. We have "register renaming". While you only access a small set (8-32 registers), they're actually backed by a much larger set (e.g 64-256). The CPU then tracks the visibility of each register, and allocates them to the renamed set. For example, you can load, modify, then store to a register many times in a row, and have each of these operations actually performed independently depending on cache misses etc. In ARM:

ldr r0, [r4]
add r0, r0, #1
str r0, [r4]
ldr r0, [r5]
add r0, r0, #1
str r0, [r5]

Cortex A9内核确实对寄存器进行了重命名,因此第一次加载到"r0"实际上是进入了重命名的虚拟寄存器-我们将其称为"v0".加载,增量和存储发生在"v0"上.同时,我们还再次执行了一次加载/修改/存储到r0的操作,但是由于使用r0是完全独立的序列,因此将其重命名为"v1".假设由于高速缓存未命中,来自"r4"中指针的负载停止了.没关系-我们不需要等待"r0"准备就绪.因为已重命名,所以我们可以使用"v1"(也映射到r0)运行下一个序列-也许这是缓存命中,我们刚刚获得了巨大的性能胜利.

Cortex A9 cores do register renaming, so the first load to "r0" actually goes to a renamed virtual register - let's call it "v0". The load, increment and store happen on "v0". Meanwhile, we also perform a load/modify/store to r0 again, but that'll get renamed to "v1" because this is an entirely independent sequence using r0. Let's say the load from the pointer in "r4" stalled due to a cache miss. That's ok - we don't need to wait for "r0" to be ready. Because it's renamed, we can run the next sequence with "v1" (also mapped to r0) - and perhaps that's a cache hit and we just had a huge performance win.

ldr v0, [v2]
add v0, v0, #1
str v0, [v2]
ldr v1, [v3]
add v1, v1, #1
str v1, [v3]

我认为x86如今已经达到了巨大的重命名寄存器数量(棒球场256).这意味着每条指令具有8位乘以2,仅表示源和目的地是什么.这将大大增加整个核心所需的电线数量及其尺寸.因此,大多数设计者都已满意16-32个寄存器,这是一个不错的选择,对于乱序的CPU设计,寄存器重命名是缓解这种情况的一种方法.

I think x86 is up to a gigantic number of renamed registers these days (ballpark 256). That would mean having 8 bits times 2 for every instruction just to say what the source and destination is. It would massively increase the number of wires needed across the core, and its size. So there's a sweet spot around 16-32 registers which most designers have settled for, and for out-of-order CPU designs, register renaming is the way to mitigate it.

编辑:乱序执行和对此进行注册重命名的重要性.一旦有了OOO,寄存器的数量就无关紧要,因为它们只是临时标签",并被重命名为更大的虚拟寄存器集.您不希望数字太小,因为很难编写小的代码序列.对于x86-32来说,这是一个问题,因为有限的8个寄存器意味着很多临时对象最终要通过堆栈,并且内核需要额外的逻辑才能将读/写转发到内存.如果您没有OOO,则通常是在谈论小型内核,在这种情况下,大型寄存器集会降低成本/性能.

Edit: The importance of out-of-order execution and register renaming on this. Once you have OOO, the number of registers doesn't matter so much, because they're just "temporary tags" and get renamed to the much larger virtual register set. You don't want the number to be too small, because it gets difficult to write small code sequences. This is a problem for x86-32, because the limited 8 registers means a lot of temporaries end up going through the stack, and the core needs extra logic to forward reads/writes to memory. If you don't have OOO, you're usually talking about a small core, in which case a large register set is a poor cost/performance benefit.

因此,对于寄存器库大小来说,这是一个自然的最佳选择,对于大多数类型的CPU,最大可容纳约32个架构寄存器. x86-32有8个寄存器,绝对太小了. ARM提供了16个寄存器,这是一个很好的折衷方案. 32个寄存器(如果有的话)有点太多-您最终不需要最后10个左右.

So there's a natural sweet spot for register bank size which maxes out at about 32 architected registers for most classes of CPU. x86-32 has 8 registers and it's definitely too small. ARM went with 16 registers and it's a good compromise. 32 registers is slightly too many if anything - you end up not needing the last 10 or so.

这些都不涉及您为SSE和其他矢量浮点协处理器获得的额外寄存器.这些作为额外的集合是有意义的,因为它们独立于整数内核运行,并且不会以指数方式增加CPU的复杂性.

None of this touches on the extra registers you get for SSE and other vector floating point coprocessors. Those make sense as an extra set because they run independently of the integer core, and don't grow the CPU's complexity exponentially.

这篇关于如果寄存器如此之快,为什么我们没有更多的寄存器呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆