MMX寄存器速度与无符号整数存储的堆栈 [英] MMX Register Speed vs Stack for Unsigned Integer Storage

I am contemplating an implementation of SHA3 in pure assembly. SHA3 has an internal state of 17 64 bit unsigned integers, but because of the transformations it uses, the best case could be achieved if I had 44 such integers available in the registers. Plus one scratch register possibly. In such a case, I would be able to do the entire transform in the registers.

But this is unrealistic, and optimisation is possible all the way down to even just a few registers. Still, more is potentially better, depending on the answer to this question.

I am thinking of using the MMX registers for fast storage at least, even if I'll need to swap into other registers for computation. But I'm concerned about that being ancient architecture.

Is data transfer between an MMX register and, say, RAX going to be faster than indexing u64s on the stack and accessing them from what's likely to be L1 cache? Or even if so, are there hidden pitfalls besides considerations of speed I should watch for? I am interested in the general case, so even if one was faster than the other on my computer, it might still be inconclusive.

解决方案

Using ymm registers as a "memory-like" storage location - it's not a win for performance. MMX wouldn't be either. The use-case is for completely avoid memory accesses which might disturb a micro-benchmark.

Efficient store-forwarding and fast L1d cache hits make using regular RAM very good. x86 allows memory operands, like add eax, [rdi], and modern CPUs can decode that to a single uop.

With MMX you'd need 2 uops, like movd edx, mm0 / add eax, edx. So that's more uops, and more latency. movd or movq latency to/from MMX or XMM registers is worse than 3 to 5 cycle store-forwarding latency on typical modern CPUs.

But if you don't need to move data back and forth often, you might be able to usefully keep some of your data in MMX / XMM registers and use pxor mm0, mm1 and so on.

If you can schedule your algorithm so you have fewer total instructions / uops from using movd/movq (int<->XMM or int<->MMX) and movq2dq/movdq2q (MMX->XMM / XMM->MMX) instructions instead of stores and memory operands or loads, then it might be a win.

But on Intel before Haswell, there are only 3 ALU execution ports, so the 4-wide superscalar pipeline could hit a narrower bottleneck (ALU throughput) than front-end throughput, if you leave the store/load ports idle.

(See https://agner.org/optimize/ and other performance links in the x86 tag wiki.)

这篇关于MMX寄存器速度与无符号整数存储的堆栈的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

MMX寄存器速度与无符号整数存储的堆栈 [英] MMX Register Speed vs Stack for Unsigned Integer Storage

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

MMX寄存器速度与无符号整数存储的堆栈 [英] MMX Register Speed vs Stack for Unsigned Integer Storage

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭