SHLD / SHRD指令的SIMD版本 [英] SIMD versions of SHLD/SHRD instructions

查看:267
本文介绍了SHLD / SHRD指令的SIMD版本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SHLD / SHRD指令是用于实现多精度转换的汇编指令。

SHLD/SHRD instructions are assembly instructions to implement multiprecisions shifts.

请考虑以下问题:

uint64_t array[4] = {/*something*/};
left_shift(array, 172);
right_shift(array, 172);

实现 left_shift 和 right_shift 这两个函数可以对四个64位无符号整数的数组进行移位,就好像它是一个大的256位无符号整数吗?

What is the most efficient way to implement left_shift and right_shift, two functions that operates a shift on an array of four 64-bit unsigned integer as if it was a big 256 bits unsigned integer?

最有效的方法是使用SHLD / SHRD指令,还是在现代体系结构上有更好的指令(如SIMD版本)?

Is the most efficient way of doing that is by using SHLD/SHRD instructions, or is there better (like SIMD versions) instructions on modern architecture?

推荐答案

在这个答案中,我仅讨论x64。如果您在2016年编写代码,则
x86已经过时15年了。停留在2000年几乎没有道理。

所有时间均根据 Agner Fog's指令表

In this answer I'm only going to talk about x64.
x86 has been outdated for 15 years now if you're coding in 2016 it hardly makes sense to be stuck in 2000.
All times are according to Agner Fog's instruction tables.

英特尔Skylake示例计时*

shld / shrd 指令在x64上相当慢。

即使在Intel skylake上,它们的延迟也为4 cyc les并使用4 uops,这意味着它会消耗大量执行单元,在较旧的处理器上,它们甚至更慢。

我将假设您要移动可变量,这意味着

Intel Skylake example timings*
The shld/shrd instructions are rather slow on x64.
Even on Intel skylake they have a latency of 4 cycles and uses 4 uops meaning it uses up a lot of execution units, on older processors they're even slower.
I'm going to assume you want to shift by a variable amount, which means a

SHLD RAX,RDX,cl        4 uops, 4 cycle latency.  -> 1/16 per bit

使用2个班次+添加即可更快地完成此操作慢一点。

Using 2 shifts + add you can do this faster slower.

@Init:
MOV R15,-1
SHR R15,cl    //mask for later use.    
@Work:
SHL RAX,cl        3 uops, 2 cycle latency
ROL RDX,cl        3 uops, 2 cycle latency
AND RDX,R15       1 uops, 0.25 latency
OR RAX,RDX        1 uops, 0.25 latency    
//Still needs unrolling to achieve least amount of slowness.

请注意,这仅移位64位,因为RDX不受影响。

因此您尝试每64位击败4个周期。

Note that this only shifts 64 bits because RDX is not affected.
So you're trying to beat 4 cycles per 64 bits.

//4*64 bits parallel shift.  
//Shifts in zeros.
VPSLLVQ YMM2, YMM2, YMM3    1uop, 0.5 cycle latency.  

但是,如果您希望它完全执行SHLD的功能,则需要使用额外的VPSLRVQ和一个OR来合并两个结果。

However if you want it to do exactly what SHLD does you'll need to use an extra VPSLRVQ and an OR to combine the two results.

VPSLLVQ YMM1, YMM2, YMM3    1uop, 0.5 cycle latency.  
VPSRLVQ YMM5, YMM2, YMM4    1uop, 0.5 cycle latency.   
VPOR    YMM1, YMM1, YMM5    1uop, 0.33 cycle latency.   

您需要交错处理其中的4组,费用为(3 * 4)+ 2 = 14 YMM寄存器。

这样做,我怀疑您会从VPADDQ的低.33延迟中受益吗,所以我将假设使用0.5的延迟。

这使3uops,1.5周期延迟对于256位= 1/171每位=每个QWord 0.37个周期=快10倍,不错。

如果您能够获得每256位1.33个周期= 1/192每位=每个QWord的0.33个周期=快12倍。

You'll need to interleave 4 sets of these costing you (3*4)+2=14 YMM registers.
Doing so I doubt you'll profit from the low .33 latency of VPADDQ so I'll assume a 0.5 latency instead.
That makes 3uops, 1.5 cycle latency for 256 bits = 1/171 per bit = 0.37 cycle per QWord = 10x faster, not bad.
If you are able to get 1.33 cycle per 256 bits = 1/192 per bit = 0.33 cycle per QWord = 12x faster.

'

显然我没有添加循环开销,也不从内存中加载/存储。

循环开销考虑到跳转目标的正确对齐,它很小,但是内存

的访问将很容易成为最大的减慢。

Skylake上的主内存一次缓存未命中可能会花费您超过250个循环 1



与之相比,使用AVX256进行12倍加速的可能性很小。

'It’s the Memory, Stupid!'
Obviously I've not added in loop overhead and load/stores to/from memory.
The loop overhead is tiny given proper alignment of jump targets, but the memory
access will easily be the biggest slowdown.
A single cache miss to main memory on Skylake can cost you more than 250 cycles1.
It is in clever management of memory that the major gains will be made.
The 12 times possible speed-up using AVX256 is small potatoes in comparison.

我没有在 CL / (YMM3 / YMM4)中计数移位计数器的设置,因为我是m假设您将在多次迭代中重复使用该值。

I'm not counting the set up of the shift counter in CL/(YMM3/YMM4) because I'm assuming you'll reuse that value over many iterations.

您不会用AVX512指令来击败它,因为尚没有带有AVX512指令的消费级CPU。

当前唯一的处理器当前支持的是骑士登陆

You're not going to beat that with AVX512 instructions, because consumer grade CPU's with AVX512 instructions are not yet available.
The only current processor that supports currently is Knights Landing.

*)所有这些时间都是最佳情况值,应将其作为指示,而不是硬值。

1 )Skylake中的缓存未命中成本:42个周期+ 52ns = 42 +(52 * 4.6Ghz)= 281个周期。

*) All these timings are best case values, and should be taken as indications, not as hard values.
1) Cost of cache miss in Skylake: 42 cycles + 52ns = 42 + (52*4.6Ghz) = 281 cycles.

这篇关于SHLD / SHRD指令的SIMD版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆