在任何情况下,使用MOVDQU和MOVUPD比使用MOVUPS更好吗? [英] Is there any situation where using MOVDQU and MOVUPD is better than MOVUPS?

查看:607
本文介绍了在任何情况下,使用MOVDQU和MOVUPD比使用MOVUPS更好吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解intel x86-64上SSE的不同MOV指令.

I was trying to understand the different MOV instructions for SSE on intel x86-64.

根据,您应该在2个寄存器之间移动数据时,请使用对齐的指令(MOVAPS,MOVAPD和MOVDQA),并根据要使用的类型使用正确的指令.并且,在将寄存器移到内存中时使用MOVUPS/MOVAPS,反之亦然,因为在移入/移出内存时类型不会影响性能.

According to this you should use aligned instructions (MOVAPS, MOVAPD and MOVDQA) when moving data between 2 registers, using the correct one for the type you're operating with. And use MOVUPS/MOVAPS when moving register to memory and vice-versa, since type does not impact performance when moving to/from memory.

那么,有没有理由使用MOVDQU和MOVUPD?我在链接上得到的解释不正确吗?

So is there any reason to use MOVDQU and MOVUPD ever? Is the explanation I got on the link wrong?

推荐答案

摘要:我不知道最近有任何x86架构在使用错误"加载指令时(例如, ,是加载指令,后面是来自相反域的ALU指令.

Summary: I am not aware of any recent x86 architecture that incurs additional delays when using the the "wrong" load instruction (i.e., a load instruction followed by an ALU instruction from the opposite domain).

这是 Agner关于绕行延迟不得不说的这是您在CPU中的各个执行域之间移动时可能引起的延迟(有时这些是不可避免的,但是有时它们可​​能是由于使用此处讨论的指令的错误"版本引起的):

Here's what Agner has to say about bypass delays which are the delays you might incur when moving between the various execution domains with in the CPU (sometimes these are unavoidable, but sometimes they may be caused by using the "wrong" version of an instruction which is at issue here):

Nehalem上的数据旁路延迟在Nehalem上,执行单元为 分为五个域":

Data bypass delays on Nehalem On the Nehalem, the execution units are divided into five "domains":

整数域可通用处理所有操作 寄存器.整数向量(SIMD)域处理整数运算 在向量寄存器中. FP域处理浮点运算 在XMM和x87寄存器中.加载域处理所有内存读取. 存储域处理所有内存存储.会有额外的延迟 在一个域中输出一个操作时的1或2个时钟周期 用作另一个域中的输入.这些所谓的旁路延迟是 列在表8.2中.

The integer domain handles all operations in general purpose registers. The integer vector (SIMD) domain handles integer operations in vector registers. The FP domain handles floating point operations in XMM and x87 registers. The load domain handles all memory reads. The store domain handles all memory stores. There is an extra latency of 1 or 2 clock cycles when the output of an operation in one domain is used as input in another domain. These so-called bypass delays are listed in table 8.2.

使用加载和存储仍然没有额外的旁路延迟 有关错误数据类型的说明.例如,它可以是 方便在整数数据上使用MOVHPS来读取或写入 XMM寄存器的上半部分.

There is still no extra bypass delay for using load and store instructions on the wrong type of data. For example, it can be convenient to use MOVHPS on integer data for reading or writing the upper half of an XMM register.

最后一段的重点是我的,它是关键部分:旁路延迟不适用于Nehalem装载和存储指令.从直觉上讲,这是有道理的:加载和存储单元专用于整个内核,并且必须以适合任何执行单元的方式(或将其存储在PRF中)提供其结果-与ALU情况不同的是,转发不存在.

The emphasis in the last paragraph is mine and is the key part: the bypass delays didn't apply to Nehalem load and store instructions. Intuitively, this makes sense: the load and store units are dedicated for the entire core and will have to make their result available in a way suitable for any execution unit (or store it in the PRF) - unlike the ALU case the same concerns with forwarding aren't present.

现在不再真正关心Nehalem了,但是在Sandy Bridge/Ivy Bridge,Haswell和Skylake的部分中,您会发现一个注释,即域名与Nehalem讨论的一样,并且总体上延迟较少.因此,可以假设基于指令类型的加载和存储行为不会受到延迟的影响仍然存在.

Now don't really care about Nehalem any more, but in the sections for Sandy Bridge/Ivy Bridge, Haswell and Skylake you'll find a note that the domains are as discussed for Nehalem, and that there are fewer delays overall. So one could assume that the behavior where loads and stores don't suffer a delay based on the instruction type remains.

我们也可以对其进行测试.我写了这样一个基准:

We can also test it. I wrote a benchmark like this:

bypass_movdqa_latency:
    sub     rsp, 120
    xor     eax, eax
    pxor    xmm1, xmm1
.top:
    movdqa  xmm0, [rsp + rax] ; 7 cycles
    pand    xmm0, xmm1        ; 1 cycle
    movq    rax, xmm0         ; 1 cycle
    dec     rdi
    jnz     .top
    add     rsp, 120
    ret

这将使用movdqa加载值,对其执行整数域运算(pand),然后将其移至通用寄存器rax,因此可以将其用作movdqa地址的一部分在下一个循环中.除了用movdqumovupsmovupd替换的movdqa外,我还创建了其他三个与上述相同的基准.

This loads a value using movdqa, does an integer domain operation (pand) on it, and then moves it to general purpose register rax so it can be used as part of the address for movdqa in the next loop. I also created 3 other benchmarks identical to the above, except with movdqa replaced with movdqu, movups and movupd.

Skylake客户端(带有最新微代码的i7-6700HQ)上的结果:

The results on Skylake-client (i7-6700HQ with recent microcode):

** Running benchmark group Vector unit bypass latency **
                     Benchmark   Cycles
  movdqa [mem] -> pxor latency     9.00
  movdqu [mem] -> pxor latency     9.00
  movups [mem] -> pxor latency     9.00
  movupd [mem] -> pxor latency     9.00

在每种情况下,往返行程延迟都是相同的:9个周期,如预期的那样:负载分别为6 +1 + 2个周期,分别为pxormovq.

In every case the rountrip latency was the same: 9 cycles, as expected: 6 + 1 + 2 cycles for the load, pxor and movq respectively.

所有这些测试都添加在 uarch-bench 中希望在其他任何架构上运行它们(我会对结果感兴趣).我使用了命令行:

All of these tests are added in uarch-bench in case you would like to run them on any other architecture (I would be interested in the results). I used the command line:

./uarch-bench.sh --test-name=vector/* --timer=libpfc

这篇关于在任何情况下,使用MOVDQU和MOVUPD比使用MOVUPS更好吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆