为什么不存放在浮动寄存器函数参数? [英] Why not store function parameters in float registers?

查看:243
本文介绍了为什么不存放在浮动寄存器函数参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在读的书:计算机系统 - 一个程序员的观点。我发现的是,在x86-64架构,我们是有限的,将被传递给函数在寄存器6积分参数。接下来的参数将在堆栈上进行传递。

I'm currently reading the book: "Computer Systems - A Programmers Perspective". I've found out that, on the x86-64 architecture, we are limited to 6 integral parameters which will be passed to a function in registers. The next parameters will be passed on the stack.

为什么不以存储下一个参数,使用float寄存器,即使参数不是单/双precision变量?这将是更为高效的(据我理解)在寄存器中存储的数据,而不是将其存储到存储器中,然后从存储器中读取它。

Why not use float registers in order to store the next parameters, even when the parameters are not single/double precision variables? It would be much more efficient (as far as I understood) to store the data in registers, than to store it to memory, and then read it from memory.

推荐答案

大部分功能没有超过6个整数参数,所以这确实是一个极端例子。在XMM寄存器传递一些多余的整数PARAMS会做出在哪里可以找到浮点ARGS更复杂,几乎没有任何好处的规则。除了这个事实,它可能的不会的使code得更快。

Most functions don't have more than 6 integer parameters, so this is really a corner case. Passing some excess integer params in xmm registers would make the rules for where to find floating point args more complicated, for little to no benefit. Besides the fact that it probably wouldn't make code any faster.

,用于存储多余的参数的另一个原因是,你的功能可能不会的使用的所有这些的时候了。如果你想调用另一个函数,你必须保存XMM寄存器内存的参数,因为你调用该函数将摧毁任何一个参数传递寄存器。 (和所有的XMM暂存器反正是主叫方保存。)因此,你可能会与code,它充塞参数代入,他们不能直接使用向量寄存器结束,并从那里将它们存储到内存中调用另一个函数之前,只有的然后的加载它们放回整数寄存器。或者,即使函数不调用等功能,也许它需要的矢量寄存器供自己使用,并且会对存储PARAMS内存释放他们为运行载体code!这本来是更容易只是 PARAMS入堆栈,因为非常大量优化,原因很明显,要做实体店及可吸入悬浮粒子在一个单一的UOP的修改,大约便宜,因为一个 MOV

A further reason for storing excess parameters in memory is that you the function probably won't use them all right away. If you want to call another function, you have to save those parameters from xmm registers to memory, because the function you call will destroy any parameter-passing registers. (And all the xmm regs are caller-saved anyway.) So you could potentially end up with code that stuffs parameters into vector registers where they can't be used directly, and from there stores them to memory before calling another function, and only then loads them back into integer registers. Or even if the function doesn't call other functions, maybe it needs the vector registers for its own use, and would have to store params to memory to free them up for running vector code! It would have been easier just to push params onto the stack, because push very heavily optimized, for obvious reasons, to do the store and the modification of RSP all in a single uop, about as cheap as a mov.

有是没有用于参数的传递有一个整数寄存器,也没有call- preserved中的的SysV的Linux / Mac上的x86-64 ABI (R11)。有懒惰的动态连接器code使用暂存寄存器这是非常有用,但不保存(因为这样的垫片功能需要通过自己的所有args设置为动态加载的功能),以及类似的包装函数。

There is one integer register that is not used for parameter passing, but also not call-preserved in the SysV Linux/Mac x86-64 ABI (r11). It's useful to have a scratch register for lazy dynamic linker code to use without saving (since such shim functions need to pass on all their args to the dynamically-loaded function), and similar wrapper functions.

所以AMD64也可以使用多个整数寄存器函数参数,但只有在寄存器,调用的函数的数量为代价在使用前保存。 (或双用途为R10不使用静态链的指针,或一些语言。)

So AMD64 could have used more integer registers for function parameters, but only at the expense of the number of registers that called functions have to save before using. (Or dual-purpose r10 for languages that don't use a "static chain" pointer, or something.)

总之,在寄存器中传递更多的参数并不总是更好的。

Anyway, more parameters passed in registers isn't always better.

XMM寄存器不能被用作指针或索引寄存器,然后从XMM移动数据寄存器回整数寄存器可以减缓周围code超过加载这仅仅是数据存储。 (如果有任何执行资源将是一个瓶颈,而不是高速缓存未命中或分支误predicts,它更可能将是ALU执行单元,而不是从XMM加载/存储单元,将数据移动到GP寄存器需要一个ALU UOP,在英特尔和AMD目前的设计中。)

xmm registers can't be used as pointer or index registers, and moving data from the xmm registers back to integer registers could slow down the surrounding code more than loading data that was just stored. (If any execution resource is going to be a bottleneck, rather than cache misses or branch mispredicts, it's more likely going to be ALU execution units, not load/store units. Moving data from xmm to gp registers takes an ALU uop, in Intel and AMD's current designs.)

L1缓存是非常快,而且store->负荷转移使得总的延迟时间往返于记忆的东西像例如5个周期英特尔的Haswell。 (如的指令潜伏期INC DWORD [存储] 6个周期,其中包括一个ALU周期。)

L1 cache is really fast, and store->load forwarding makes the total latency for a round trip to memory something like 5 cycles on e.g. Intel Haswell. (The latency of an instruction like inc dword [mem] is 6 cycles, including the one ALU cycle.)

如果从XMM将数据移动到GP寄存器是的所有的你打算做(有没有别的保持ALU执行单元忙),那么,英特尔的CPU的往返时延 MOVD XMM0,EAX / MOVD EAX,XMM0 (2次英特尔的Haswell)小于<$ C $的延迟C> MOV [存储],EAX / MOV EAX,[存储] (5次英特尔的Haswell),但整数code通常ISN 'T完全被延迟瓶颈的方式FP code往往是。

If moving data from xmm to gp registers was all you were going to do (with nothing else to keep the ALU execution units busy), then yes, on Intel CPUs the round trip latency for movd xmm0, eax / movd eax, xmm0 (2 cycles Intel Haswell) is less than the latency of mov [mem], eax / mov eax, [mem] (5 cycles Intel Haswell), but integer code usually isn't totally bottlenecked by latency the way FP code often is.

在AMD推土机系列CPU,其中两个整数核心共享一个矢量/ FP单元,直接GP寄存器和矢量暂存器之间移动数据实际上是相当缓慢(8或10个周期的一种方式,或者说对压路机的一半)。内存往返只有8次。

On AMD Bulldozer-family CPUs, where two integer cores share a vector/FP unit, moving data directly between GP regs and vector regs is actually quite slow (8 or 10 cycles one way, or half that on Steamroller). A memory round trip is only 8 cycles.

32位code管理运行相当好,即使的所有的参数在栈中传递,并已被加载。 CPU都非常高的存储参数压入堆栈,然后再加载它们进行优化,因为这些混沌旧的32位ABI仍用于$的的很多的C $ C,电除尘器。在Windows上。 (大多数Linux系统运行大部分64位code,而大多数的Windows桌面系统上运行32位的很多code,因为这么多的Windows程序只能用作pre-编译32位二进制文​​件。)

32bit code manages to run reasonably well, even though all parameters are passed on the stack, and have to be loaded. CPUs are very highly optimized for storing parameters onto the stack and then loading them again, because the crufty old 32bit ABI is still used for a lot of code, esp. on Windows. (Most Linux systems mostly run 64bit code, while most Windows desktop systems run a lot of 32bit code because so many Windows programs are only available as pre-compiled 32bit binaries.)

请参阅 http://agner.org/optimize/ 为CPU微架构指南,了解如何找出其实有多少个周期的东西拿。有在 86 维基等良好的联系,其中包括在X86-64 ABI文档上面链接。

See http://agner.org/optimize/ for CPU microarchitecture guides to learn how to figure out how many cycles something will actually take. There are other good links in the x86 wiki, including the x86-64 ABI doc linked above.

这篇关于为什么不存放在浮动寄存器函数参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆