为什么不在 XMM 向量寄存器中存储函数参数? [英] Why not store function parameters in XMM vector registers?

查看:28
本文介绍了为什么不在 XMM 向量寄存器中存储函数参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在阅读这本书:计算机系统 - 程序员的观点".我发现,在 x86-64 架构上,我们仅限于 6 个整数参数,这些参数将传递给寄存器中的函数.下一个参数将在堆栈上传递.

I'm currently reading the book: "Computer Systems - A Programmers Perspective". I've found out that, on the x86-64 architecture, we are limited to 6 integral parameters which will be passed to a function in registers. The next parameters will be passed on the stack.

而且,第一个最多 8 个 FP 或向量参数在 xmm0..7 中传递.

And also, the first up-to-8 FP or vector args are passed in xmm0..7.

为什么不使用浮点寄存器来存储下一个参数,即使参数不是单/双精度变量?

Why not use float registers in order to store the next parameters, even when the parameters are not single/double precision variables?

将数据存储在寄存器中比将数据存储到内存然后从内存中读取要高效得多(据我所知).

It would be much more efficient (as far as I understood) to store the data in registers, than to store it to memory, and then read it from memory.

推荐答案

大多数函数的整数参数不超过 6 个,所以这确实是一个极端情况.在 xmm 寄存器中传递一些多余的整数参数会使在哪里找到浮点参数的规则更加复杂,几乎没有好处.除此之外,它可能不会使代码更快.

Most functions don't have more than 6 integer parameters, so this is really a corner case. Passing some excess integer params in xmm registers would make the rules for where to find floating point args more complicated, for little to no benefit. Besides the fact that it probably wouldn't make code any faster.

在内存中存储过多参数的另一个原因是您的函数可能不会立即使用它们.如果你想调用另一个函数,你必须将那些参数从 xmm 寄存器保存到内存中,因为你调用的函数会破坏任何参数传递寄存器.(而且所有的 xmm regs 无论如何都是调用者保存的.)所以你可能最终得到的代码将参数填充到不能直接使用的向量寄存器中,然后在调用另一个函数之前将它们存储到内存中,并且只有then 将它们加载回整数寄存器.或者即使该函数不调用其他函数,也可能需要向量寄存器供自己使用,并且必须将参数存储到内存中以释放它们以运行向量代码!将 push 参数放到堆栈上会更容易,因为 push 非常优化,出于显而易见的原因,将 RSP 的存储和修改全部集中在一个uop,大约和 mov 一样便宜.

A further reason for storing excess parameters in memory is that you the function probably won't use them all right away. If you want to call another function, you have to save those parameters from xmm registers to memory, because the function you call will destroy any parameter-passing registers. (And all the xmm regs are caller-saved anyway.) So you could potentially end up with code that stuffs parameters into vector registers where they can't be used directly, and from there stores them to memory before calling another function, and only then loads them back into integer registers. Or even if the function doesn't call other functions, maybe it needs the vector registers for its own use, and would have to store params to memory to free them up for running vector code! It would have been easier just to push params onto the stack, because push very heavily optimized, for obvious reasons, to do the store and the modification of RSP all in a single uop, about as cheap as a mov.

SysV Linux/Mac x86-64 ABI (r11).为惰性动态链接器代码使用临时寄存器而不保存(因为此类填充函数需要将其所有参数传递给动态加载的函数)和类似的包装函数非常有用.

There is one integer register that is not used for parameter passing, but also not call-preserved in the SysV Linux/Mac x86-64 ABI (r11). It's useful to have a scratch register for lazy dynamic linker code to use without saving (since such shim functions need to pass on all their args to the dynamically-loaded function), and similar wrapper functions.

所以 AMD64 本来可以为函数参数使用更多的整数寄存器,但代价是调用函数在使用前必须保存的寄存器数量.(或者对于不使用静态链"指针或其他东西的语言的双重用途 r10.)

So AMD64 could have used more integer registers for function parameters, but only at the expense of the number of registers that called functions have to save before using. (Or dual-purpose r10 for languages that don't use a "static chain" pointer, or something.)

无论如何,在寄存器中传递的参数越多并不总是越好.

Anyway, more parameters passed in registers isn't always better.

xmm 寄存器不能用作指针或索引寄存器,将数据从 xmm 寄存器移回整数寄存器可能比加载刚刚存储的数据更能减慢周围代码的速度.(如果任何执行资源将成为瓶颈,而不是缓存未命中或分支错误预测,则更有可能是 ALU 执行单元,而不是加载/存储单元.将数据从 xmm 移动到 gp 寄存器需要 ALU uop,在英特尔和 AMD 当前的设计.)

xmm registers can't be used as pointer or index registers, and moving data from the xmm registers back to integer registers could slow down the surrounding code more than loading data that was just stored. (If any execution resource is going to be a bottleneck, rather than cache misses or branch mispredicts, it's more likely going to be ALU execution units, not load/store units. Moving data from xmm to gp registers takes an ALU uop, in Intel and AMD's current designs.)

L1 缓存真的很快,而且 store->load forwarding 使得往返内存的总延迟大约为 5 个周期,例如英特尔哈斯韦尔.(像inc dword [mem]这样的指令的延迟是6个周期,包括一个ALU周期.)

L1 cache is really fast, and store->load forwarding makes the total latency for a round trip to memory something like 5 cycles on e.g. Intel Haswell. (The latency of an instruction like inc dword [mem] is 6 cycles, including the one ALU cycle.)

如果将数据从 xmm 移动到 gp 寄存器是所有你要做的(没有其他事情可以让 ALU 执行单元保持忙碌),那么是的,在英特尔 CPU 上,往返延迟 code>movd xmm0, eax/movd eax, xmm0 (2 个周期 Intel Haswell) 小于 mov [mem], eax/ 的延迟>mov eax, [mem](5 个周期 Intel Haswell),但整数代码通常不会像 FP 代码那样完全受到延迟的限制.

If moving data from xmm to gp registers was all you were going to do (with nothing else to keep the ALU execution units busy), then yes, on Intel CPUs the round trip latency for movd xmm0, eax / movd eax, xmm0 (2 cycles Intel Haswell) is less than the latency of mov [mem], eax / mov eax, [mem] (5 cycles Intel Haswell), but integer code usually isn't totally bottlenecked by latency the way FP code often is.

在 AMD Bulldozer 系列 CPU 上,其中两个整数内核共享一个向量/FP 单元,在 GP regs 和向量 regs 之间直接移动数据实际上非常慢(单向 8 或 10 个周期,或者是 Steamroller 的一半).一次内存往返只有8个周期.

On AMD Bulldozer-family CPUs, where two integer cores share a vector/FP unit, moving data directly between GP regs and vector regs is actually quite slow (8 or 10 cycles one way, or half that on Steamroller). A memory round trip is only 8 cycles.

32 位代码设法运行得相当好,即使所有 参数都在堆栈上传递,并且必须加载.CPU 高度优化用于将参数存储到堆栈然后再次加载它们,因为笨拙的旧 32 位 ABI 仍然用于很多代码,尤其是.在 Windows 上.(大多数 Linux 系统大多运行 64 位代码,而大多数 Windows 桌面系统运行大量 32 位代码,因为很多 Windows 程序只能作为预编译的 32 位二进制文​​件使用.)

32bit code manages to run reasonably well, even though all parameters are passed on the stack, and have to be loaded. CPUs are very highly optimized for storing parameters onto the stack and then loading them again, because the crufty old 32bit ABI is still used for a lot of code, esp. on Windows. (Most Linux systems mostly run 64bit code, while most Windows desktop systems run a lot of 32bit code because so many Windows programs are only available as pre-compiled 32bit binaries.)

有关 CPU 微架构指南,请参阅 http://agner.org/optimize/ 以了解如何计算某事实际需要多少个周期. wiki 中还有其他很好的链接,包括上面链接的 x86-64 ABI 文档.

See http://agner.org/optimize/ for CPU microarchitecture guides to learn how to figure out how many cycles something will actually take. There are other good links in the x86 wiki, including the x86-64 ABI doc linked above.

这篇关于为什么不在 XMM 向量寄存器中存储函数参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆