128位值-从XMM寄存器到通用 [英] 128-bit values - From XMM registers to General Purpose

查看:129
本文介绍了128位值-从XMM寄存器到通用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个与将XMM值移动到通用寄存器有关的问题.关于SO的所有问题都针对相反的问题,即将gp寄存器中的值传输到XMM.

I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM.

  1. 如何将XMM寄存器值(128位)移至两个64位通用寄存器?

  1. How can I move an XMM register value (128-bit) to two 64-bit general purpose registers?

movq RAX XMM1 ; 0th bit to 63th bit
mov? RCX XMM1 ; 64th bit to 127th bit

  • 类似地,如何将XMM寄存器值(128位)移至四个32位通用寄存器?

  • Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers?

    movd EAX XMM1 ; 0th bit to 31th bit
    mov? ECX XMM1 ; 32th bit to 63th bit
    
    mov? EDX XMM1 ; 64th bit to 95th bit
    mov? ESI XMM1 ; 96th bit to 127 bit
    

  • 推荐答案

    您不能将XMM寄存器的高位直接移到通用寄存器中.
    您必须遵循一个两步过程,该过程可能会或可能不会涉及到内存的往返或寄存器的破坏.

    You cannot move the upper bits of an XMM register into a general purpose register directly.
    You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.

    在寄存器(SSE2)中

    movq rax,xmm0       ;lower 64 bits
    movhlps xmm0,xmm0   ;move high 64 bits to low 64 bits.
    movq rbx,xmm0       ;high 64 bits.
    

    punpckhqdq xmm0,xmm0 movhlps xmm0,xmm0 .如果xmm0最后由整数指令而不是FP写入,则某些CPU可能会避免一个或两个周期的旁路延迟.

    punpckhqdq xmm0,xmm0 is the SSE2 integer equivalent of movhlps xmm0,xmm0. Some CPUs may avoid a cycle or two of bypass latency if xmm0 was last written by an integer instruction, not FP.

    通过内存(SSE2)

    movdqu [mem],xmm0
    mov rax,[mem]
    mov rbx,[mem+8]
    

    速度慢,但不会破坏xmm寄存器(SSE4.1)

    mov rax,xmm0
    pextrq rbx,xmm0,1        ;3 cycle latency on Ryzen! (and 2 uops)
    

    混合策略是可能的,例如将其存储到内存movd/q e/rax,xmm0,以便快速准备就绪,然后重新加载较高的元素. (不过,存储转发延迟并不比ALU差很多.)这使您可以平衡不同后端执行单元的uops.当您需要很多小元素时,存储/重新加载特别好. (将mov/movzx加载到32位寄存器中很便宜,并且具有2/clock的吞吐量.)

    A hybrid strategy is possible, e.g. store to memory, movd/q e/rax,xmm0 so it's ready quickly, then reload the higher elements. (Store-forwarding latency is not much worse than ALU, though.) That gives you a balance of uops for different back-end execution units. Store/reload is especially good when you want lots of small elements. (mov / movzx loads into 32-bit registers are cheap and have 2/clock throughput.)

    对于32位,代码类似:

    For 32 bits, the code is similar:

    在寄存器中

    movd eax,xmm0
    psrldq xmm0,xmm0,4    ;shift 4 bytes to the right
    movd ebx,xmm0
    psrldq xmm0,xmm0,4    ; pshufd could copy-and-shuffle the original reg
    movd ecx,xmm0         ; not destroying the XMM and maybe creating some ILP
    psrlq xmm0,xmm0,4
    movd edx,xmm0
    

    通过内存

    movdqu [mem],xmm0
    mov eax,[mem]
    mov ebx,[mem+4]
    mov ecx,[mem+8]
    mov edx,[mem+12]
    

    不破坏xmm寄存器(SSE4.1)(慢于psrldq/pshufd版本)

    movd eax,xmm0
    pextrd ebx,xmm0,1        ;3 cycle latency on Skylake!
    pextrd ecx,xmm0,2        ;also 2 uops: like a shuffle(port5) + movd(port0)
    pextrd edx,xmm0,3       
    


    64位移位变量可以运行2个周期. pextrq版本至少需要4个.对于32位,数字分别为4和10.


    The 64-bit shift variant can run in 2 cycles. The pextrq version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.

    这篇关于128位值-从XMM寄存器到通用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆