MMX寄存器速度与无符号整数存储的堆栈 [英] MMX Register Speed vs Stack for Unsigned Integer Storage

查看:68
本文介绍了MMX寄存器速度与无符号整数存储的堆栈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑在纯汇编中实现SHA3.SHA3​​的内部状态为17个64位无符号整数,但是由于使用了转换,如果我在寄存器中有44个这样的整数可用,则可以达到最佳状态.再加上一个暂存器.在这种情况下,我将能够在寄存器中进行整个转换.

但是,这是不现实的,并且可能一直进行优化,甚至只有几个寄存器.尽管如此,取决于对这个问题的答案,更多可能更好.

我正在考虑至少将MMX寄存器用于快速存储,即使我需要交换到其他寄存器进行计算.但是我担心这是古老的建筑.

MMX寄存器和RAX之间的数据传输是否比索引堆栈上的u64并从可能的L1缓存访问它们更快?或即使是这样,除了我应该注意的速度之外,还有隐藏的陷阱吗?我对一般情况感兴趣,因此即使计算机上的一个速度比另一个速度快,也可能尚无定论.

解决方案

-这不是性能上的胜利.MMX也不会.该用例用于完全避免可能破坏微基准的内存访问.

高效的存储转发和快速的L1d缓存命中率使使用常规RAM变得非常好.x86允许使用诸如 add eax,[rdi] 之类的内存操作数,现代CPU可以将其解码为单个uop.

使用MMX,您需要2次操作,例如 moved edx,mm0 / add eax,edx .因此,更多的机会和更多的延迟.在现代现代CPU上,往/从MMX或XMM寄存器的 moved movq 延迟比3到5个周期的存储转发延迟更糟糕.


但是,如果您不需要经常来回移动数据,则可以将一些数据有效地保存在MMX/XMM寄存器中,并使用 pxor mm0,mm1等.

如果您可以安排算法,以便使用 movd/movq (int<-> XMM或int<-> MMX)和 movq2dq / movdq2q (MMX-> XMM/XMM-> MMX)指令,而不是存储和内存操作数或加载,则可能是一个胜利.

但是在Haswell之前的Intel上,只有3个ALU执行端口,因此,如果您将存储/加载端口保持空闲状态,那么4宽超标量流水线可能会遇到比前端吞吐量更窄的瓶颈(ALU吞吐量)./p>

(请参见x86标签Wiki .)

I am contemplating an implementation of SHA3 in pure assembly. SHA3 has an internal state of 17 64 bit unsigned integers, but because of the transformations it uses, the best case could be achieved if I had 44 such integers available in the registers. Plus one scratch register possibly. In such a case, I would be able to do the entire transform in the registers.

But this is unrealistic, and optimisation is possible all the way down to even just a few registers. Still, more is potentially better, depending on the answer to this question.

I am thinking of using the MMX registers for fast storage at least, even if I'll need to swap into other registers for computation. But I'm concerned about that being ancient architecture.

Is data transfer between an MMX register and, say, RAX going to be faster than indexing u64s on the stack and accessing them from what's likely to be L1 cache? Or even if so, are there hidden pitfalls besides considerations of speed I should watch for? I am interested in the general case, so even if one was faster than the other on my computer, it might still be inconclusive.

解决方案

Using ymm registers as a "memory-like" storage location - it's not a win for performance. MMX wouldn't be either. The use-case is for completely avoid memory accesses which might disturb a micro-benchmark.

Efficient store-forwarding and fast L1d cache hits make using regular RAM very good. x86 allows memory operands, like add eax, [rdi], and modern CPUs can decode that to a single uop.

With MMX you'd need 2 uops, like movd edx, mm0 / add eax, edx. So that's more uops, and more latency. movd or movq latency to/from MMX or XMM registers is worse than 3 to 5 cycle store-forwarding latency on typical modern CPUs.


But if you don't need to move data back and forth often, you might be able to usefully keep some of your data in MMX / XMM registers and use pxor mm0, mm1 and so on.

If you can schedule your algorithm so you have fewer total instructions / uops from using movd/movq (int<->XMM or int<->MMX) and movq2dq/movdq2q (MMX->XMM / XMM->MMX) instructions instead of stores and memory operands or loads, then it might be a win.

But on Intel before Haswell, there are only 3 ALU execution ports, so the 4-wide superscalar pipeline could hit a narrower bottleneck (ALU throughput) than front-end throughput, if you leave the store/load ports idle.

(See https://agner.org/optimize/ and other performance links in the x86 tag wiki.)

这篇关于MMX寄存器速度与无符号整数存储的堆栈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆