用 4 个单独的双打加载 x64 ymm 寄存器的有效方法是什么? [英] What efficient way to load x64 ymm register with 4 seperated doubles?

查看:30
本文介绍了用 4 个单独的双打加载 x64 ymm 寄存器的有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

加载 x64 ymm 寄存器的最有效方法是什么

  1. 4 个均匀间隔的双打,即一组连续的双打

    0 1 2 3 4 5 6 7 8 9 10 .. 100我想加载例如 0, 10, 20, 30

  2. 4 个任意位置的双打

     即我想加载例如 1, 6, 22, 43

解决方案

最简单的方法是 VGATHERQPD 这是 Haswell 及更高版本上可用的 AVX2 指令.

VGATHERQPD ymm1, [rsi+xmm7*8], ymm2

<块引用>

使用 vm32x 中指定的双字索引,从以 ymm2 指定的掩码为条件的内存中收集双精度 FP 值.有条件收集的元素合并到 ymm1 中.

可以通过一条指令实现这一点.这里ymm2是掩码寄存器,最高位表示该值是否应该复制到ymm1(保持不变).ymm7 包含具有比例因子的元素的索引.

因此应用于您的示例,它在 MASM 语法中可能如下所示:

<块引用>

4 个均匀间隔的双打,即一组连续的双打

0 1 2 3 4 5 6 7 8 9 10 .. 100 --- 我想加载例如 0, 10, 20, 30

.data.align 16qq指数dq 0,10,20,30dpValues REAL8 0,1,2,3, ... 100.代码学习,dpValuesmovapd ymm7, qqIndicesvpcmpeqw ymm1, ymm1 ;设置为所有vgatherqpd ymm0, [rsi+xmm7*8], ymm1

现在 ymm0 包含四个双精度值 0、10、20、30.不过,我还没有测试过这个.另一件事要提到的是,这不一定是每种情况下最快的选择.这些值都是单独收集的,也就是说,每个值都需要一次内存访问,参见AVX2中的收集指令是如何实现的

因此根据 Mysticial 的评论

<块引用>

我最近不得不做一些需要真正收集负载的事情.(即数据[索引[i]]).在 Haswell 上,4 index loading + 2x movsd + 2x movhpd + vinsertf128 仍然比 ymm load + vgatherqpd 快得多.所以即使在最好的情况下,4路聚集仍然失败.不过我还没有尝试过 8 路聚集.

最快的方法是使用这种方法.

如此高效"以操作码方式将使用 VGATHER 和高效";与执行时间相关的将是最后一个(到目前为止,让我们看看未来的架构将如何执行).

根据评论,VGATHER 指令在 Broadwell 和 Skylake 上变得更快.

What is the most efficient way to load a x64 ymm register with

  1. 4 doubles evenly spaced i.e. a contiguous set of doubles

    0  1  2  3  4  5  6  7  8  9 10 .. 100
    And i want to load for example 0, 10, 20, 30
    

  2. 4 doubles at any position

    i.e. i want to load for example 1, 6, 22, 43
    

解决方案

The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up.

VGATHERQPD ymm1, [rsi+xmm7*8], ymm2

Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.

which can achieve this with one instruction. Here ymm2 is the mask register with the highest bit indicating if the value should be copied to ymm1 or not(left unchanged). ymm7 contains the indices of the elements with the scale factor.

So applied to your examples, it could look like this in MASM syntax:

4 doubles evenly spaced i.e. a contiguous set of doubles

0 1 2 3 4 5 6 7 8 9 10 .. 100 --- And i want to load for example 0, 10, 20, 30

.data
  .align 16
  qqIndices dq 0,10,20,30
  dpValues  REAL8 0,1,2,3, ... 100
.code
  lea rsi, dpValues
  movapd ymm7, qqIndices
  vpcmpeqw ymm1, ymm1                     ; set to all ones
  vgatherqpd ymm0, [rsi+xmm7*8], ymm1
  

Now ymm0 contains four doubles 0, 10, 20, 30. Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implemented

So according to Mysticial's comment

I recently had to do something that required a true gather-load. (i.e. data[index[i]]). On Haswell, 4 index loads + 2x movsd + 2x movhpd + vinsertf128 is still significantly faster than a ymm load + vgatherqpd. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though.

the fastest way would be using that approach.

So "efficient" in an OpCode way would be using VGATHER and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).

EDIT: according to comments the VGATHER instructions get faster on Broadwell and Skylake.

这篇关于用 4 个单独的双打加载 x64 ymm 寄存器的有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆