有什么有效的方式来加载64位寄存器青运与4分开双打? [英] What efficient way to load x64 ymm register with 4 seperated doubles?

查看:176
本文介绍了有什么有效的方式来加载64位寄存器青运与4分开双打?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是装载了64位YMM寄存器的最有效方式。


  1. 4双打均匀分布的,即一组连续的双打

      0 1 2 3 4 5 6 7 8 9 10 ... 100
    我想加载例如0,10,20,30


  2. 4双打在任何位置

     即。我想加载例如1,6,22,43



解决方案

最简单的方法是 VGATHERQPD 它可在Haswell的和最多的AVX2指令。

  VGATHERQPD ymm1,[RSI + XMM7 * 8],ymm2


  

使用vm32x指定DWORD指数,从内存条件对ymm2指定的mask收集双pre-FP cision值。有条件聚集元素合并到ymm1。


可与一个指令实现这一目标。
在这里, ymm2 的面具最高位,表明如果该值应该被复制到 ymm1 或不登记(保持不变)。
ymm7 包含与比例因子的元素的索引。

所以,应用到你的例子,它可能看起来像这样在MASM语法:


  

4双打均匀分布的,即一组连续的双打


  
  

0 1 2 3 4 5 6 7 8 9 10 .. 100 ---我想加载例如0,10,20,30


 。数据
  .align伪16
  qqIndices DQ 0,10,20,30
  dpValues​​ REAL8 0,1,2,3,... 100
。code
  LEA RSI,dpValues
  MOVAPD ymm7,qqIndices
  vpcmpeqw ymm1,ymm1;设置为全1
  vgatherqpd ymm0,[RSI + XMM7 * 8],ymm1

现在 ymm0 包含四个双打0,10,20,30。
虽然,我还没有测试过这一点。提到另一件事是,这不一定在每一个场景中最快的选择。值都分别聚集,这意味着,每个值需要一个内存访问,请参阅如何在AVX2的收集指令实现

所以,根据<一个href=\"http://stackoverflow.com/questions/21774454/how-are-the-gather-instructions-in-avx2-implemented/21778409#comment56126627_21778409\">Mysticial's评论


  

最近,我不得不做一些事情,需要一个真正的收集负载。 (即数据[指数[I])。在Haswell的, 4指数负荷+ 2X MOVSD + 2X movhpd + vinsertf128 仍显著比快青运负载+ vgatherqpd 。因此,即使在最好的情况下,4路集仍然输了。我没有尝试过的8路虽然聚集


最快的方法是使用这种方法。

在OP code方式,使高效将使用 VGATHER 和高效与执行时间将是最后一个(到目前为止,我们看结构今后如何将执行)。

编辑:根据意见 VGATHER 指令得到Broadwell微架构和SKYLAKE微架构更快

What is the most efficient way to load a x64 ymm register with

  1. 4 doubles evenly spaced i.e. a contiguous set of doubles

    0  1  2  3  4  5  6  7  8  9 10 .. 100
    And i want to load for example 0, 10, 20, 30
    

  2. 4 doubles at any position

    i.e. i want to load for example 1, 6, 22, 43
    

解决方案

The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up.

VGATHERQPD ymm1, [rsi+xmm7*8], ymm2

Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.

which can achieve this with one instruction. Here ymm2 is the mask register with the highest bit indicating if the value should be copied to ymm1 or not(left unchanged). ymm7 contains the indices of the elements with the scale factor.

So applied to your examples, it could look like this in MASM syntax:

4 doubles evenly spaced i.e. a contiguous set of doubles

0 1 2 3 4 5 6 7 8 9 10 .. 100 --- And i want to load for example 0, 10, 20, 30

.data
  .align 16
  qqIndices dq 0,10,20,30
  dpValues  REAL8 0,1,2,3, ... 100
.code
  lea rsi, dpValues
  movapd ymm7, qqIndices
  vpcmpeqw ymm1, ymm1                     ; set to all ones
  vgatherqpd ymm0, [rsi+xmm7*8], ymm1

Now ymm0 contains four doubles 0, 10, 20, 30. Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implemented

So according to Mysticial's comment

I recently had to do something that required a true gather-load. (i.e. data[index[i]]). On Haswell, 4 index loads + 2x movsd + 2x movhpd + vinsertf128 is still significantly faster than a ymm load + vgatherqpd. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though.

the fastest way would be using that approach.

So "efficient" in an OpCode way would be using VGATHER and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).

EDIT: according to comments the VGATHER instructions get faster on Broadwell and Skylake.

这篇关于有什么有效的方式来加载64位寄存器青运与4分开双打?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆