将两个DWORD打包到QWORD中以节省存储带宽 [英] Packing two DWORDs into a QWORD to save store bandwidth

查看:125
本文介绍了将两个DWORD打包到QWORD中以节省存储带宽的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下类似以下的加载-存储循环,该循环从非连续位置加载DWORD并将其连续存储:

Imagine a load-store loop like the following which loads DWORDs from non-contiguous locations and stores them contiguously:

top:
mov eax, DWORD [rsi]
mov DWORD [rdi], eax
mov eax, DWORD [rdx]
mov DWORD [rdi + 4], eax
; unroll the above a few times
; increment rdi and rsi somehow
cmp ...
jne top

在现代的Intel和AMD硬件上,当在高速缓存中运行时,这样的循环通常会在每个周期的一个存储中造成瓶颈.这是一种浪费,因为那只是一个2的IPC(一个商店,一个负载).

On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load).

一个自然产生的想法是将两个DWORD负载合并到单个QWORD存储库中,这是可能的,因为存储区是连续的.这样的事情可能会起作用:

One idea that naturally arises is to combine two DWORD loads into a single QWORD store which is possible since the stores are contiguous. Something like this could work:

top:
mov eax, DWORD [rsi]
mov ebx, DWORD [rdx]
shl rbx, 32
or  rax, rbx
mov QWORD [rdi] 

基本上完成两个加载,并使用两个ALU op将它们组合为一个QWORD,我们可以在一个存储中存储它.现在,我们在uops上遇到了瓶颈:每2个DWORD s 5个uops-因此每个QWORD 1.25个周期或每个DWORD 0.625个周期.

Basically do the two loads and use two ALU ops to combine them into a single QWORD which we can store with a single store. Now we're bottlenecked on uops: 5 uops per 2 DWORDs - so 1.25 cycles per QWORD or 0.625 cycles per DWORD.

已经比第一种方法好得多,但是我不禁认为这种改组还有更好的选择-例如,我们通过使用普通负载浪费了uop吞吐量-感觉 像我们应该能够将至少一些ALU操作与带有内存源操作数的负载组合在一起,但是我主要受英特尔的困扰:内存上的shl仅具有RMW形式,而shlxrolx don微型保险丝.

Already much better than the first option, but I can't help but think there is a better option for this shuffling - for example, we are wasting uop throughput by using plain loads - It feels like we should be able to combine at least some of the ALU ops with the loads with memory source operands, but I was mostly stymied on Intel: shl on memory only has a RMW form, and shlx and rolx don't micro-fuse.

似乎我们也可以通过将第二个负载设为QWORD负载-4来免费获得转移,但随后我们要清除负载DWORD中的垃圾.

It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.

我对标量代码感兴趣,并且对基本x86-64指令集和更好版本的代码(如果可能)进行了扩展,例如BMI.

I'm interested in scalar code, and code for both the base x86-64 instruction set and better versions if possible with useful extensions like BMI.

推荐答案

似乎我们也可以通过将第二个负载的QWORD负载偏移-4来免费获得转移,但是接下来我们要清除负载DWORD中的垃圾.

It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.

如果正确性和性能(高速缓存行拆分...)的负载较大,我们可以使用shld

If wider loads are ok for correctness and performance (cache-line splits...), we can use shld

top:
    mov eax, DWORD [rsi]
    mov rbx, QWORD [rdx-4]     ; unaligned(?) 64-bit load

    shld rax, rbx, 32          ; 1 uop on Intel SnB-family, 0.5c recip throughput
    mov QWORD [rdi], rax


MMB punpckldq mm0, [mem] SnB系列(包括Skylake)上的微熔丝.


MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake).

top:
    movd       mm0, DWORD [rsi]
    punpckldq  mm0, QWORD [rdx]     ; 1 micro-fused uop on Intel SnB-family

    movq       QWORD [rdi], mm0

 ; required after the loop, making it only worth-while for long-running loops
 emms

punpckl指令不幸的是具有向量宽度的内存操作数,而不是半角宽度.这通常会破坏它们在其他情况下是完美的用途(尤其是必须对齐16B内存操作数的SSE2版本).但是请注意,MMX版本(仅带有qword内存操作数)没有对齐要求.

punpckl instructions unfortunately have a vector-width memory operand, not half-width. This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned). But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.

您还可以使用128位AVX版本,但这越可能越过缓存行边界并且速度较慢. (Skylake不会仅通过加载所需的8个字节来进行优化;具有对齐的mov + vpunckldq xmm1, xmm0, [cache_line-8]的循环以每2个时钟1 iter的速度运行,而以每个时钟为1 iter的循环运行.)如果出现以下情况,则需要AVX版本进行故障处理16字节的负载会跨入一个未映射的页面,因此在没有来自负载端口的额外支持的情况下,它不能仅使用较窄的负载. :/

You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow. (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov + vpunckldq xmm1, xmm0, [cache_line-8] runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.) The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/

这样一个令人沮丧且无用的设计决策(大概是在加载端口可以免费零扩展之前做出的,而不是用AVX修复的).至少我们可以用movhps代替内存源punpcklqdq,但是实际上不能替换的较窄宽度不能被替换.

Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX). At least we have movhps as a replacement for memory-source punpcklqdq, but narrower widths that actually shuffle can't be replaced.

为避免CL分裂,还可以使用单独的movd负载和punpckldq或SSE4.1 pinsrd.这样,就没有理由使用MMX.

To avoid CL-splits, you could also use a separate movd load and punpckldq, or SSE4.1 pinsrd. With this, there's no reason for MMX.

top:
    movd       xmm0, DWORD [rsi]

    movd       xmm1, DWORD [rdx]           ; SSE2
    punpckldq  xmm0, xmm1
    ; or pinsrd  xmm0, DWORD [rdx], 1      ; 2 uops not micro-fused

    movq       QWORD [rdi], xmm0


很明显,AVX2 vpgatherdd是可能的,并且在Skylake上可能表现良好.


Obviously AVX2 vpgatherdd is a possibility, and may perform well on Skylake.

这篇关于将两个DWORD打包到QWORD中以节省存储带宽的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆