128位值-从XMM寄存器到通用 [英] 128-bit values - From XMM registers to General Purpose
问题描述
我有几个与将XMM值移动到通用寄存器有关的问题.关于SO的所有问题都针对相反的问题,即将gp寄存器中的值传输到XMM.
I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM.
-
如何将XMM寄存器值(128位)移至两个64位通用寄存器?
How can I move an XMM register value (128-bit) to two 64-bit general purpose registers?
movq RAX XMM1 ; 0th bit to 63th bit
mov? RCX XMM1 ; 64th bit to 127th bit
类似地,如何将XMM寄存器值(128位)移至四个32位通用寄存器?
Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers?
movd EAX XMM1 ; 0th bit to 31th bit
mov? ECX XMM1 ; 32th bit to 63th bit
mov? EDX XMM1 ; 64th bit to 95th bit
mov? ESI XMM1 ; 96th bit to 127 bit
推荐答案
您不能将XMM寄存器的高位直接移到通用寄存器中.
您必须遵循一个两步过程,该过程可能会或可能不会涉及到内存的往返或寄存器的破坏.
You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.
在寄存器(SSE2)中
movq rax,xmm0 ;lower 64 bits
movhlps xmm0,xmm0 ;move high 64 bits to low 64 bits.
movq rbx,xmm0 ;high 64 bits.
punpckhqdq xmm0,xmm0
是 movhlps xmm0,xmm0
.如果xmm0最后由整数指令而不是FP写入,则某些CPU可能会避免一个或两个周期的旁路延迟.
punpckhqdq xmm0,xmm0
is the SSE2 integer equivalent of movhlps xmm0,xmm0
. Some CPUs may avoid a cycle or two of bypass latency if xmm0 was last written by an integer instruction, not FP.
通过内存(SSE2)
movdqu [mem],xmm0
mov rax,[mem]
mov rbx,[mem+8]
速度慢,但不会破坏xmm寄存器(SSE4.1)
mov rax,xmm0
pextrq rbx,xmm0,1 ;3 cycle latency on Ryzen! (and 2 uops)
混合策略是可能的,例如将其存储到内存movd/q e/rax,xmm0
,以便快速准备就绪,然后重新加载较高的元素. (不过,存储转发延迟并不比ALU差很多.)这使您可以平衡不同后端执行单元的uops.当您需要很多小元素时,存储/重新加载特别好. (将mov
/movzx
加载到32位寄存器中很便宜,并且具有2/clock的吞吐量.)
A hybrid strategy is possible, e.g. store to memory, movd/q e/rax,xmm0
so it's ready quickly, then reload the higher elements. (Store-forwarding latency is not much worse than ALU, though.) That gives you a balance of uops for different back-end execution units. Store/reload is especially good when you want lots of small elements. (mov
/ movzx
loads into 32-bit registers are cheap and have 2/clock throughput.)
对于32位,代码类似:
For 32 bits, the code is similar:
在寄存器中
movd eax,xmm0
psrldq xmm0,xmm0,4 ;shift 4 bytes to the right
movd ebx,xmm0
psrldq xmm0,xmm0,4 ; pshufd could copy-and-shuffle the original reg
movd ecx,xmm0 ; not destroying the XMM and maybe creating some ILP
psrlq xmm0,xmm0,4
movd edx,xmm0
通过内存
movdqu [mem],xmm0
mov eax,[mem]
mov ebx,[mem+4]
mov ecx,[mem+8]
mov edx,[mem+12]
不破坏xmm寄存器(SSE4.1)(慢于psrldq
/pshufd
版本)
movd eax,xmm0
pextrd ebx,xmm0,1 ;3 cycle latency on Skylake!
pextrd ecx,xmm0,2 ;also 2 uops: like a shuffle(port5) + movd(port0)
pextrd edx,xmm0,3
64位移位变量可以运行2个周期. pextrq
版本至少需要4个.对于32位,数字分别为4和10.
The 64-bit shift variant can run in 2 cycles. The pextrq
version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.
这篇关于128位值-从XMM寄存器到通用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!