从GP规则加载xmm [英] Loading an xmm from GP regs

查看:105
本文介绍了从GP规则加载xmm的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您要将raxrdx中的值加载到xmm寄存器中.

一种方法是:

movq     xmm0, rax
pinsrq   xmm0, rdx, 1

虽然相当慢!有更好的方法吗?

解决方案

在最近的Intel或AMD上,您的延迟或uop计数不会做得更好(我主要查看Agner Fog的Ryzen/Skylake表).对于相同的端口,movq+movq+punpcklqdq也是3 uops.

在Intel/AMD上,将GP寄存器存储到一个临时位置并用16字节的读取值重新加载它们,如果整数->向量的ALU端口(对于端口5为端口)周围的代码瓶颈,则可能值得考虑吞吐量.最近的英特尔.

在Intel上,端口5的pinsrq x,r,imm是2 oups,端口5的movq xmm,r64也是1 uop.

movhps xmm, [mem]可以对熔断器进行微熔,但仍需要5 ALU uop端口.因此,movq xmm0,rax/mov [rsp-8], rdx/movhps xmm0, [rsp-8]是3个融合域uops,其中2个在最近的Intel上需要端口5.存储转发延迟使此延迟比插入延迟高得多.

用store/store/movdqa存储两个GP寄存器(通过读取两个较大的较窄存储库,将较长的存储库转发停顿)也为3 ups,但这是避免任何端口5 uop的唯一合理顺序.大约15个周期的延迟是如此之多,以至于乱序执行很容易将其隐藏起来.


对于YMM和/或更窄的元素,商店+重新装货更值得考虑,因为您可以在更多商店中摊销摊位/可以节省更多洗牌.但这仍然不应该成为32位元素的首选策略.

对于较窄的元素,如果有将2个较窄的整数打包到64位整数寄存器中的单-uop方式,那就太好了,因此可以设置为更广泛地传输到XMM regs.但是没有:将两个DWORD打包到QWORD中为节省存储带宽 shld在Intel SnB系列中为1 uop,但需要寄存器顶部的输入之一.与PowerPC或ARM相比,x86的位域插入/提取指令非常弱,每次合并需要多个指令(存储/重装除外,每个时钟1个存储吞吐量很容易成为瓶颈).


AVX512F可以从整数reg广播到向量,并且merge-masking允许单-uop插入.

根据 http://instlatx64.atw.hu/中的电子表格(从IACA获取uop数据),只需花费1个port5 uop,即可将任意宽度的整数寄存器广播到Skylake-AVX512上的ax/y/zmm向量.

Agner似乎没有在KNL上测试整数源寄存器,但是类似的VPBROADCASTMB2Q v,k(掩码寄存器源)为1 uop.

已经设置了掩码寄存器:总共仅2微妙:

; k1 = 0b0010

vmovq         xmm0, rax           ; 1 uop p5             ; AVX1
vpbroadcastq  xmm0{k1}, rdx       ; 1 uop p5  merge-masking

认为合并屏蔽对于ALU uops也是免费的".请注意,我们首先要做VMOVQ,因此可以避免使用更长的EVEX编码.但是,如果您在掩码reg中使用0001而不是0010,请使用vmovq xmm0{k1}, rax将其混合到未掩码的广播中.

设置了更多的掩码寄存器,我们每uop可以执行1次注册:

vmovq         xmm0, rax                         2c latency
vpbroadcastq  xmm0{k1}, rdx   ; k1 = 0b0010     3c latency
vpbroadcastq  ymm0{k2}, rdi   ; k2 = 0b0100     3c latency
vpbroadcastq  ymm0{k3}, rsi   ; k3 = 0b1000     3c latency

(对于完整的ZMM向量,也许要启动第二条dep链和vinserti64x4来组合256位的一半.这也意味着只需要3 k寄存器而不是7寄存器.它会花费1个额外的shuffle uop,但是除非有一些软件流水线操作) ,OoO执行人员可能无法在对向量执行任何操作之前隐藏7合并= 21c的延迟.)

; high 256 bits: maybe better to start again with vmovq instead of continuing
vpbroadcastq  zmm0{k4}, rcx   ; k4 =0b10000     3c latency
... filling up the ZMM reg

根据引用该来源和其他来源的Instlatx64电子表格,即使目的地只有xmm,英特尔在SKX上列出的vpbroadcastq延迟仍为3c. http://instlatx64.atw.hu/

同一文档的确将vpbroadcastq xmm,xmm列为1c延迟,因此可以肯定的是,在合并依赖关系链中,每步获取3c延迟是正确的.不幸的是,合并屏蔽uops要求目标寄存器要早于其他输入准备就绪.因此该操作的合并部分无法单独转发.


k1 = 2 = 0b0010开始,我们可以使用 KSHIFT :

mov      eax, 0b0010 = 2
kmovw    k1, eax
KSHIFTLW k2, k1, 1
KSHIFTLW k3, k1, 2

#  KSHIFTLW k4, k1, 3
# ...

KSHIFT仅在端口5(SKX)上运行,但KMOV也在;从整数寄存器中删除每个掩码只会花费额外的指令来首先设置整数regs.

如果向量的高字节用广播而不是零填充,实际上是可以的,因此我们可以使用0b1110/0b1100等作为掩码.
我们最终写出所有元素.我们可以从KXNOR k0, k0,k0开始生成-1并向左移,但这是2个port5 uops与mov eax,2/kmovw k1, eax为p0156 + p5.


没有掩码寄存器 :(没有kmov k1, imm,并且从内存中进行加载需要花费多倍的微指令,因此一次性情况下,没有使用合并掩码的3uop选项.如果您可以省下一些面膜调节剂,则循环使用,看起来似乎要好得多[em] .

VPBROADCASTQ  xmm1, rdx           ; 1 uop  p5      ; AVX512VL (ZMM1 for just AVX512F)
vmovq         xmm0, rax           ; 1 uop p5             ; AVX1
vpblendd      xmm0, xmm0, xmm1, 0b1100    ; 1 uop p015   ; AVX2

; SKX: 3 uops:  2p5 + p015
; KNL: 3 uops: ? + ? + FP0/1

这里唯一的好处是3个uops之一不需要端口5.

vmovsd xmm1, xmm1, xmm0也可以将这两个部分混合在一起,但是只能在最近的Intel的端口5上运行,而不像整数立即混合可以在任何矢量ALU端口上运行.


有关整数->向量策略的更多讨论

gcc喜欢存储/重新加载,这在任何情况下都不是最佳选择,除非在非常罕见的端口5绑定情况下,其中大量延迟无关紧要.我提交了 https://gcc.gnu.org/bugzilla/show_bug.cgi?id = 80820 https://gcc.gnu.org/bugzilla/show_bug.cgi?id = 80833 ,并进一步讨论了在32位或64位元素的各种体系结构上最佳选择.

我建议在第一个bug上使用上述vpbroadcastq替换用AVX512进行插入.

(如果编译_mm_set_epi64x,请务必使用-mtune=haswell或最新版本,以避免对默认的mtune=generic进行繁琐的调整.或者,如果二进制文件仅在本地计算机上运行,​​请使用-march=native.)

Let's say you have values in rax and rdx you want to load into an xmm register.

One way would be:

movq     xmm0, rax
pinsrq   xmm0, rdx, 1

It's pretty slow though! Is there a better way?

解决方案

You're not going to do better for latency or uop count on recent Intel or AMD (I mostly looked at Agner Fog's tables for Ryzen / Skylake). movq+movq+punpcklqdq is also 3 uops, for the same port(s).

On Intel / AMD, storing the GP registers to a temporary location and reloading them with a 16-byte read may be worth considering for throughput if surrounding code bottlenecks on the ALU port for integer->vector, which is port 5 for recent Intel.

On Intel, pinsrq x,r,imm is 2 uops for port 5 and movq xmm,r64 is also 1 uop for port 5.

movhps xmm, [mem] can micro-fuse the load, but it still needs a port 5 ALU uop. So movq xmm0,rax / mov [rsp-8], rdx / movhps xmm0, [rsp-8] is 3 fused-domain uops, 2 of them needing port 5 on recent Intel. The store-forwarding latency makes this significantly higher latency than an insert.

Storing both GP regs with store / store / movdqa (long store-forwarding stall from reading the two narrower stores with a larger load) is also 3 uops, but is the only reasonable sequence that avoids any port 5 uops. The ~15 cycles of latency is so much that Out-of-Order execution could easily have trouble hiding it.


For YMM and/or narrower elements, stores + reload is more worth considering because you amortize the stall over more stores / it saves you more shuffle uops. But it still shouldn't be your go-to strategy for 32-bit elements.

For narrower elements, it would be nice if there was a single-uop way of packing 2 narrow integers into a 64-bit integer register, so set up for wider transfers to XMM regs. But there isn't: Packing two DWORDs into a QWORD to save store bandwidth shld is 1 uop on Intel SnB-family but needs one of the inputs at the top of a register. x86 has pretty weak bitfield insert/extract instructions compared to PowerPC or ARM, requiring multiple instructions per merge (other than store/reload, and store throughput of 1 per clock can easily become a bottleneck).


AVX512F can broadcast to a vector from an integer reg, and merge-masking allows single-uop inserts.

According to the spreadsheet from http://instlatx64.atw.hu/ (taking uop data from IACA), it only costs 1 port5 uop to broadcast any width of integer register to a x/y/zmm vector on Skylake-AVX512.

Agner doesn't seem to have tested integer source regs on KNL, but a similar VPBROADCASTMB2Q v,k (mask register source) is 1 uop.

With a mask register already set up: only 2 uops total:

; k1 = 0b0010

vmovq         xmm0, rax           ; 1 uop p5             ; AVX1
vpbroadcastq  xmm0{k1}, rdx       ; 1 uop p5  merge-masking

I think merge-masking is "free" even for ALU uops. Note that we do the VMOVQ first so we can avoid a longer EVEX encoding for it. But if you have 0001 in a mask reg instead of 0010, blend it into an unmasked broadcast with vmovq xmm0{k1}, rax.

With more mask registers set up, we can do 1 reg per uop:

vmovq         xmm0, rax                         2c latency
vpbroadcastq  xmm0{k1}, rdx   ; k1 = 0b0010     3c latency
vpbroadcastq  ymm0{k2}, rdi   ; k2 = 0b0100     3c latency
vpbroadcastq  ymm0{k3}, rsi   ; k3 = 0b1000     3c latency

(For a full ZMM vector, maybe start a 2nd dep chain and vinserti64x4 to combine 256-bit halves. Also means only 3 k registers instead of 7. It costs 1 extra shuffle uop, but unless there's some software pipelining, OoO exec might have trouble hiding the latency of 7 merges = 21c before you do anything with your vector.)

; high 256 bits: maybe better to start again with vmovq instead of continuing
vpbroadcastq  zmm0{k4}, rcx   ; k4 =0b10000     3c latency
... filling up the ZMM reg

Intel's listed latency for vpbroadcastq on SKX is still 3c even when the destination is only xmm, according to the Instlatx64 spreadsheet which quotes that and other sources. http://instlatx64.atw.hu/

The same document does list vpbroadcastq xmm,xmm as 1c latency, so presumably it's correct that we get 3c latency per step in the merging dependency chain. Merge-masking uops unfortunately need the destination register to be ready as early as other inputs; so the merging part of the operation can't forward separately.


Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT:

mov      eax, 0b0010 = 2
kmovw    k1, eax
KSHIFTLW k2, k1, 1
KSHIFTLW k3, k1, 2

#  KSHIFTLW k4, k1, 3
# ...

KSHIFT runs only on port 5 (SKX), but so does KMOV; moving each mask from integer registers would just cost extra instructions to set up integer regs first.

It's actually ok if the upper bytes of the vector are filled with broadcasts, not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.
We eventually write all the elements. We could start with KXNOR k0, k0,k0 to generate a -1 and left-shift that, but that's 2 port5 uops vs. mov eax,2 / kmovw k1, eax being p0156 + p5.


Without a mask register: (There's no kmov k1, imm, and loading from memory costs multiple uops, so as a one-off there's no 3-uop option using merge-masking. But in a loop if you can spare some mask regs, that appears to be far better.)

VPBROADCASTQ  xmm1, rdx           ; 1 uop  p5      ; AVX512VL (ZMM1 for just AVX512F)
vmovq         xmm0, rax           ; 1 uop p5             ; AVX1
vpblendd      xmm0, xmm0, xmm1, 0b1100    ; 1 uop p015   ; AVX2

; SKX: 3 uops:  2p5 + p015
; KNL: 3 uops: ? + ? + FP0/1

The only benefit here is that one of the 3 uops doesn't need port 5.

vmovsd xmm1, xmm1, xmm0 would also blend the two halves, but only runs on port 5 on recent Intel, unlike an integer immediate blend which runs on any vector ALU port.


More discussion about integer -> vector strategies

gcc likes to store/reload, which is not optimal on anything except in very rare port 5-bound situations where a large amount of latency doesn't matter. I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833, with more discussion of what might be optimal on various architectures for 32-bit or 64-bit elements.

I suggested the above vpbroadcastq replacement for insert with AVX512 on the first bug.

(If compiling _mm_set_epi64x, definitely use -mtune=haswell or something recent, to avoid the crappy tuning for the default mtune=generic. Or use -march=native if your binaries will only run on the local machine.)

这篇关于从GP规则加载xmm的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆