将数据放入 SIMD 寄存器需要多少个周期? [英] How many cycle does need for put a data into SIMD register?

查看:38
本文介绍了将数据放入 SIMD 寄存器需要多少个周期?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一名学习 x86 和 ARM 架构的学生.

I'm a student who learning x86 and ARM architecture.

我想知道将多个数据放入 SIMD 寄存器需要多少个周期?

And I was wondering that how many cycle does need for putting multiple datas into SIMD registers?

我知道 x86 SSE 的 xmms 寄存器有 128 位大小的寄存器.

I understand that x86 SSE's xmms register has 128 bit size of register.

如果我想通过 SIMD 指令集和汇编语言将 8 位数据中的 32 个放入堆栈中的一个 xmms 寄存器,该怎么办,

What if I want to put 32 of 8 bit of data into one of xmms register from the stack via SIMD instruction set and via assembly language,

通用寄存器的 PUSH/POP 是否具有相同的周期时间?

does it have same amount of cycle time for general purpose register's PUSH/POP?

还是每 8 位数据需要 32 倍的时间?

or does it needs 32x of time for each 8bit of data?

感谢您的关注和关心!

推荐答案

简短回答:

如果您要执行多次重复的 128 位加载,则可以使用 Sandy Bridge、Ivy Bridge 和 Haswell 每个时钟周期实现两个 128 位加载,或者使用 Nahelem 每个时钟周期实现一个 128 位加载.Nahelem 之前的处理器取决于您是执行对齐加载还是未对齐加载.

If you're doing many repeated 128-bit loads then it's possible to achieve two 128-bit loads per clock cycle with Sandy Bridge, Ivy Bridge and Haswell or one 128-bit load per clock cycle with Nahelem. Processors before Nahelem depend on if you do an aligned load or unaligned load.

长答案:

Mystical 在Agner Fog 的说明表中为您提供了所需的信息.但让我为你(和我自己)解释清楚.

Mystical gave you the information you need at Agner Fog's Instruction Tables. But let me spell it out for you (and myself).

您要查看的指令是:MOVDQUMOVDQA,操作数为 x、m128.它们都将在一次操作中将 128 位数据加载到 XMM/YMM 寄存器中.MOVDQA 要求地址按 16 字节对齐.MOVDQU 没有这样的限制.

The instructions you want to look at are: MOVDQU and MOVDQA with operands x, m128. These both will load 128-bits of data in one operation into a XMM/YMM register. MOVDQA requires that the address by 16 byte aligned. MOVDQU has no such restriction.

您要查看的两个指标是延迟和互惠吞吐量(越低越好).自 Nahelem 和 Sandy Bridge 以来,这些指标发生了两个重要变化:

The two metrics you want to look at are latency and reciprocal throughput (lower is better). Two important changes happened to these metrics since Nahelem and Sandy Bridge:

  1. 在 Nahelem 之前的英特尔处理器对于 MOVDQU 具有更高的延迟和互惠吞吐量.但是,由于 Nahelem MOVDQUMOVDQA 具有相同的延迟和互易吞吐量.

  1. Intel Processors before Nahelem had a higher latency and reciprocal throughput for MOVDQU. However, since Nahelem MOVDQU and MOVDQA have identical latency and reciprocal throughput.

自 Sandy Bridge 以来的所有英特尔处理器都可以同时执行两个 128 位加载.这可以在 intels-haswell-architecture.您可以看到,在 Nahelem 中,只有端口 2 可以执行 128 位加载,而在 Sandy Bridge 和 Haswell(以及 Ivy Bridge)中,它们可以使用端口 2 和 3 同时执行两个 128 位加载(这就是它们的工作方式)一个 AVX 负载).因此,Nahelem 的倒数吞吐量为 1,而 Sandy Bridge 的倒数吞吐量为 0.5.

All Intel processors since Sandy Bridge can do two 128-bit loads at the same time. This can be seen nicely at intels-haswell-architecture. You can see that in Nahelem only port 2 can do a 128-bit load whereas in Sandy Bridge and Haswell (and Ivy Bridge) they can do two 128-bit loads at the same time with port 2 and 3 (which is how they do one AVX load). So the reciprocal throughput for Nahelem is 1 whereas for Sandy Bridge it's 0.5.

然而,即使 MOVDQAMOVDQU 对每个处理器具有相同的延迟和互惠吞吐量,这并不意味着它们会提供相同的性能.如果地址不是 16 字节对齐,则永久性可能会下降.您可以使用 ScottD 在 使用 qmake 成功编译 SSE 指令(但无法识别 SSE2),其中我下降了大约 4%.我认为这是由于地址跨越缓存行的情况(例如,一个缓存行中的前 64 位和另一个缓存行中的下一个 64 位),否则性能是相同的.这实际上意味着自 Nahelem 以来没有理由再使用 MOVDQA.唯一的区别是内存对齐.

However, even though MOVDQA and MOVDQU have identical latency and reciprocal throughput for each processor since Nahelem that does not mean they will give identical performance. If the address is not 16 byte aligned then the permanence may drop. You can test this with the code by ScottD at Successful compilation of SSE instruction with qmake (but SSE2 is not recognized) where I got about a 4% drop. I think this is due to cases where an address crosses a cache line (e.g. first 64-bits in one cache line and next 64-bits in another), otherwise the performance is equal. This effectively means there is no reason to use MOVDQA anymore since Nahelem. The only difference is in memory alignment.

我说Haswell可以同时做两个128位的加载.事实上,它可以同时做两个256-load.

I said that Haswell can do two 128-bit loads at the same time. In fact, it can do two 256-loads at the same time.

事实证明,使用 SSE 未对齐的加载指令不能与另一个操作折叠.折叠允许CPU使用微操作融合(虽然这并不意味着它会融合但不折叠肯定不会融合).因此,说对齐加载指令自 Nehalem 以来已过时并不完全准确.更准确地说,它们已被 AVX 淘汰(随英特尔的 Sandy Bridge 一起提供).不过,在实践中,除非在某些特殊情况下,否则不折叠可能没什么区别.

It turns out that with SSE unaligned load instructions cannot be folded with another operation. Folding allows the CPU to use micro-op fusion (though it does not mean it will fuse but without folding it certainly won't fuse). So it's not entirely accurate to say that aligned load instruction are obsolete since Nehalem. It's more accurate to say they are obsolete with AVX (which arrive with Sandy Bridge for Intel). Though, in practice not folding probably makes little difference except in some special cases.

这篇关于将数据放入 SIMD 寄存器需要多少个周期?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆