有什么有效的方式来加载64位寄存器青运与4分开双打? [英] What efficient way to load x64 ymm register with 4 seperated doubles?
问题描述
什么是装载了64位YMM寄存器的最有效方式。
-
4双打均匀分布的,即一组连续的双打
0 1 2 3 4 5 6 7 8 9 10 ... 100
我想加载例如0,10,20,30 -
4双打在任何位置
即。我想加载例如1,6,22,43
最简单的方法是 VGATHERQPD 它可在Haswell的和最多的AVX2指令。
VGATHERQPD ymm1,[RSI + XMM7 * 8],ymm2
使用vm32x指定DWORD指数,从内存条件对ymm2指定的mask收集双pre-FP cision值。有条件聚集元素合并到ymm1。
块引用>可与一个指令实现这一目标。
在这里,ymm2
的面具最高位,表明如果该值应该被复制到ymm1
或不登记(保持不变)。ymm7
包含与比例因子的元素的索引。所以,应用到你的例子,它可能看起来像这样在MASM语法:
4双打均匀分布的,即一组连续的双打
0 1 2 3 4 5 6 7 8 9 10 .. 100 ---我想加载例如0,10,20,30
块引用>。数据
.align伪16
qqIndices DQ 0,10,20,30
dpValues REAL8 0,1,2,3,... 100
。code
LEA RSI,dpValues
MOVAPD ymm7,qqIndices
vpcmpeqw ymm1,ymm1;设置为全1
vgatherqpd ymm0,[RSI + XMM7 * 8],ymm1现在
ymm0
包含四个双打0,10,20,30。
虽然,我还没有测试过这一点。提到另一件事是,这不一定在每一个场景中最快的选择。值都分别聚集,这意味着,每个值需要一个内存访问,请参阅如何在AVX2的收集指令实现所以,根据<一个href=\"http://stackoverflow.com/questions/21774454/how-are-the-gather-instructions-in-avx2-implemented/21778409#comment56126627_21778409\">Mysticial's评论
最近,我不得不做一些事情,需要一个真正的收集负载。 (即数据[指数[I])。在Haswell的,
4指数负荷+ 2X MOVSD + 2X movhpd + vinsertf128
仍显著比快青运负载+ vgatherqpd
。因此,即使在最好的情况下,4路集仍然输了。我没有尝试过的8路虽然聚集
块引用>最快的方法是使用这种方法。
在OP code方式,使高效将使用
VGATHER
和高效与执行时间将是最后一个(到目前为止,我们看结构今后如何将执行)。编辑:根据意见
VGATHER
指令得到Broadwell微架构和SKYLAKE微架构更快What is the most efficient way to load a x64 ymm register with
4 doubles evenly spaced i.e. a contiguous set of doubles
0 1 2 3 4 5 6 7 8 9 10 .. 100 And i want to load for example 0, 10, 20, 30
4 doubles at any position
i.e. i want to load for example 1, 6, 22, 43
解决方案The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up.
VGATHERQPD ymm1, [rsi+xmm7*8], ymm2
Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.
which can achieve this with one instruction. Here
ymm2
is the mask register with the highest bit indicating if the value should be copied toymm1
or not(left unchanged).ymm7
contains the indices of the elements with the scale factor.So applied to your examples, it could look like this in MASM syntax:
4 doubles evenly spaced i.e. a contiguous set of doubles
0 1 2 3 4 5 6 7 8 9 10 .. 100 --- And i want to load for example 0, 10, 20, 30
.data .align 16 qqIndices dq 0,10,20,30 dpValues REAL8 0,1,2,3, ... 100 .code lea rsi, dpValues movapd ymm7, qqIndices vpcmpeqw ymm1, ymm1 ; set to all ones vgatherqpd ymm0, [rsi+xmm7*8], ymm1
Now
ymm0
contains four doubles 0, 10, 20, 30. Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implementedSo according to Mysticial's comment
I recently had to do something that required a true gather-load. (i.e. data[index[i]]). On Haswell,
4 index loads + 2x movsd + 2x movhpd + vinsertf128
is still significantly faster than aymm load + vgatherqpd
. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though.the fastest way would be using that approach.
So "efficient" in an OpCode way would be using
VGATHER
and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).EDIT: according to comments the
VGATHER
instructions get faster on Broadwell and Skylake.这篇关于有什么有效的方式来加载64位寄存器青运与4分开双打?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!